Refefer / fastxml

FastXML / PFastXML / PFastreXML - Implementation of Extreme Multi-label Classification
Other
149 stars 47 forks source link

Trainer.fit "Requires list of csr_matrix" #3

Closed xiaohan2012 closed 7 years ago

xiaohan2012 commented 7 years ago

I tried to train a model with input:

X_train
<4768x31412 sparse matrix of type '<class 'numpy.float64'>'
    with 398434 stored elements in Compressed Sparse Row format>
Y_train  # of length 4768
[[52, 62, 33],
 [31],
 [71], ...]

then I run:

from fastxml import Trainer, Inferencer
trainer = Trainer(n_trees=32, n_jobs=-1)
trainer.fit(X_train, Y_train)

it gives

AssertionError                            Traceback (most recent call last)
<ipython-input-15-f463a58ca9a3> in <module>()
      1 trainer = Trainer(n_trees=32, n_jobs=-1)
      2 
----> 3 trainer.fit(X_train, Y_train)
      4 
      5 

/usr/local/lib/python3.5/dist-packages/fastxml-2.0.0-py3.5-linux-x86_64.egg/fastxml/trainer.py in fit(self, X, y, weights)
    463 
    464     def fit(self, X, y, weights=None):
--> 465         self.roots = self._build_roots(X, y, weights)
    466         if self.leaf_classifiers:
    467             self.norms_, self.uxs_, self.xr_ = self._compute_leaf_probs(X, y)

/usr/local/lib/python3.5/dist-packages/fastxml-2.0.0-py3.5-linux-x86_64.egg/fastxml/trainer.py in _build_roots(self, X, y, weights)
    381 
    382     def _build_roots(self, X, y, weights):
--> 383         assert isinstance(X, list) and isinstance(X[0], sp.csr_matrix), "Requires list of csr_matrix"
    384         if self.n_jobs > 1:
    385             f = fork_call(self.grow_root)

AssertionError: Requires list of csr_matrix

why does it require list of csr_matrix? what does each csr_matrix mean?

Refefer commented 7 years ago

Hi there!

The problem you're running into is that we need to have each example as a separate csr_matrix, with data type float32. The reason we do this is due to how we split the training set as FastXML trains each level of the tree: after each split, we operate on a smaller subset of the data.

Why not do this automatically for the user? Big reason is because splitting CSR matrices is really expensive: we can easily double training time just by creating slices of the matrix. On some larger datasets, it requires nearly double the amount of memory and makes larger datasets harder to operate on.

Same reason applies for why we use float32 instead of float64: it doubles the memory requirement.

As for how to solve your immediate problem:

X_train_new = [X_train[i].astype('float32') for i in range(X_train.shape[0])]

Let me know if this helps :)

xiaohan2012 commented 7 years ago

Hi,

It works! Thank you!

Han

On Thu, Aug 31, 2017 at 2:57 AM, Andrew Stanton notifications@github.com wrote:

Hi there!

The problem you're running into is that we need to have each example as a separate csr_matrix, with data type float32. The reason we do this is due to how we split the training set as FastXML trains each level of the tree: after each split, we operate on a smaller subset of the data.

Why not do this automatically for the user? Big reason is because splitting CSR matrices is really expensive: we can easily double training time just by creating slices of the matrix. On some larger datasets, it requires nearly double the amount of memory and makes larger datasets harder to operate on.

Same reason applies for why we use float32 instead of float64: it doubles the memory requirement.

As for how to solve your immediate problem: X_train_new = [X_train[i].astype('float32') for i in range(x.shape[0])]

Let me know if this helps :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Refefer/fastxml/issues/3#issuecomment-326151075, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwLaUJY5hMEr66Z-cHtpV6JETwGUZnaks5sdfbhgaJpZM4PGglY .

-- Best

Han