Closed yupbank closed 6 years ago
Great question. Because at each stage of the tree we end up re-splitting the dataset, if you give it a sparse matrix Python, will keep having to recreate each of the CSR rows individually. This is incredibly slow and wastes several factors more memory.
I enforce the data to be a list of sparse matrices so we don't have to do a full memory copy to convert it from a csr_matrix to a list of csr matrices.
https://github.com/Refefer/fastxml/blob/master/fastxml/trainer.py#L383