Bioconductor / SparseArray

High-performance sparse data representation and manipulation in R
8 stars 2 forks source link

Better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object #5

Open hpages opened 3 years ago

hpages commented 3 years ago

I heard someone say:

SparseArray could also provide a better method than Matrix::readMM for reading MatrixMarket formats directly into a SparseArray object.

Sounds good to me.

LTLA commented 3 years ago

It was I!

While we're on this topic, you could also take scuttle::readSparseCounts() off my hands. This will read a matrix in a dense CSV file into a dgCMatrix by chunk-wise processing. Such dense matrices are quite common, especially in older scRNA-seq studies where no one had an idea about what to do with sparse matrices and just treated them in the same way as bulk RNA-seq.

hpages commented 1 year ago

SparseArray has readSparseCSV() which is similar to scuttle::readSparseCounts() but returns an SVT_SparseArray object (of type() "integer") instead of a dgCMatrix object (the user can just coerce if they want the latter). If it does what you need, feel free to deprecate scuttle::readSparseCounts() in favor of that. If it doesn't, let me know how it should be improved.

LTLA commented 1 year ago

Thanks Herve. A couple of thoughts.

Comparing SparseArray::readSparseCSV() and scuttle::readSparseCounts(), there are quite a few options in the latter that are not (yet) in the former. This refers to many of the read.table-like options such as skip.*, *.names, etc. Inspecting some real-world usage, I can see some CSVs with, e.g., different quoting methods for the row/column names, sometimes column names aren't present, sometimes rows/columns need to be skipped. I remember the Zeisel datasets being particularly tedious, though that particular script was before I wrote readSparseCounts(). Anyway, my point is that it would be helpful to have a few of these options, given that there isn't a standard way of storing matrices in CSVs and users need to be able to adapt to whatever zany formatting was provided by the data generator.

The other thought is that my original comment actually refers to Matrix::readMM, which creates a dgTMatrix by default. There's an opportunity for decent optimization if we can read this directly into a sparse array. If this were available, DropletUtils::read10xCounts() would switch to it ASAP.

hpages commented 1 year ago

Makes sense. Thanks for the feedback. Added to the TODO list: https://github.com/Bioconductor/SparseArray/blob/872c617a8f856d3b86b33f2881086276cb0d506d/TODO#L88-L90

drighelli commented 1 year ago

I was opening an issue about this new readSparseCSV function, but it seems to be related to Aaron's comment.

I found the function really fast (4 times faster than data.table::fread in my case) and helpful, but, in particular, I noticed that the function automatically seems to assign the first column present in the file to the rownames of the returned SVT_SparseMatrix, which could not always be the wanted behaviour.

I'm sure this will be easily solved with the already-mentioned improvements because, at the actual status, there is no argument allowing to specify the column to use for the rownames.

I hope this could be helpful :)

hpages commented 1 year ago

Thanks @drighelli . So many things on SparseArray's TODO but your input helps me prioritize things.

drighelli commented 1 week ago

Hi @hpages, sorry to ping you on this, but I was wondering if there are any updates on this issue.

It would be interest to use the readSparseCSV function for spatial data importing functions.

Thanks, Dario