Refefer / fastxml

FastXML / PFastXML / PFastreXML - Implementation of Extreme Multi-label Classification
Other
149 stars 47 forks source link

help using from python #6

Closed dfalbel closed 6 years ago

dfalbel commented 6 years ago

First of all, thank very much for your work! I still didn't understand how to pass values from python.

I have a scipy.csr_matrix with dimensions (3.000.000, 8.000) which I am passing to the fit method. But I get a message: AssertionError: Requires list of csr_matrix.

Do I need to input a list of 3.000.000 elements, each one as acsr matrix?

Thanks

Refefer commented 6 years ago

Hey there - that's correct. The reason for that boils down performance: the big issue is that indexing into a sparse matrix requires python to create a new sparse matrix, which over the course of several trees added around 40-50% more time to train (creating them is expensive). I originally tried to directly index into the matrices, but that caused big memory issues when training with multiple threads.

Let me know if you have any issues!

dfalbel commented 6 years ago

Alright! Thanks for the fast answer! It's good to know that python creates a new matrix wen indexing, never thought about that! I'll let you know if have any more issues!

Em qui, 14 de dez de 2017 às 21:35, Andrew Stanton notifications@github.com escreveu:

Hey there - that's correct. The reason for that boils down performance: the big issue is that indexing into a sparse matrix requires python to create a new sparse matrix, which over the course of several trees added around 40-50% more time to train (creating them is expensive). I originally tried to directly index into the matrices, but that caused big memory issues when training with multiple threads.

Let me know if you have any issues!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Refefer/fastxml/issues/6#issuecomment-351868737, or mute the thread https://github.com/notifications/unsubscribe-auth/AEfSBpSFBOHJc1536n8hm8rqerCIVQr_ks5tAbCtgaJpZM4RCYGi .

Refefer commented 6 years ago

It's pretty annoying how slow the scipy sparse classes are. A good portion of the cython code is implementing optimized code to make using them faster, like computing dot products :)