Refefer / fastxml

FastXML / PFastXML / PFastreXML - Implementation of Extreme Multi-label Classification
Other
149 stars 47 forks source link

why do we limit X to be a list of csr_matrix for training ? #14

Closed yupbank closed 6 years ago

yupbank commented 6 years ago

https://github.com/Refefer/fastxml/blob/master/fastxml/trainer.py#L383

Refefer commented 6 years ago

Great question. Because at each stage of the tree we end up re-splitting the dataset, if you give it a sparse matrix Python, will keep having to recreate each of the CSR rows individually. This is incredibly slow and wastes several factors more memory.

I enforce the data to be a list of sparse matrices so we don't have to do a full memory copy to convert it from a csr_matrix to a list of csr matrices.