UCLouvain-CBIO / scp

Single cell proteomics data processing
https://uclouvain-cbio.github.io/scp/index.html
21 stars 2 forks source link

Sparse arrays with NAs #75

Open lgatto opened 1 month ago

lgatto commented 1 month ago

The example below speaks for itself:

> library(scp)
> data("leduc_minimal", package = "scp")
> library(SparseArray)
> NaArray(assay(leduc_minimal))
<200 x 73 NaMatrix> of type "double" [nnacount=7182 (49%)]:
                  eAL00219RI5  eAL00219RI6 ... wAL00286RI17 wAL00286RI18
       SAVEDEGLK           NA           NA   .     11.01346          NaN
        APNVVVTR     13.56997     13.85389   .     11.80130     11.45907
       IVVVTAGVR           NA           NA   .     12.68021     12.82554
  GFQEVVTPNIFNSR     10.78733      9.36700   .          NaN          NaN
  QLNNLALLCQNQGK           NA           NA   .           NA           NA
             ...            .            .   .            .            .
      LTDQVMQNPR    10.292322     9.723388   .           NA           NA
      LGAEVYHTLK     9.423704     8.523444   .           NA           NA
 AANSLEAFIFETQDK          NaN          NaN   .          NaN          NaN
       FLLAVSRDR           NA           NA   .     10.34729     10.51096
EASMVITESPAALQLR           NA           NA   .           NA           NA

Ping @cvanderaa

lgatto commented 1 month ago

The NaArray is work in progress.

A comment on the above from Hervé:

You seem to have quite a few NaN's too. You'll improve sparsity, and hence reduce memory footprint, if you replace them with NA's.

For a full dataset without any NaNs:

> x <- getWithColData(leduc2022_plexDIA(), 47)
see ?scpdata and browseVignettes('scpdata') for documentation
loading from cache
Warning message:
'experiments' dropped; see 'drops()' 
> object.size(assay(x))
3283096 bytes
> object.size(NaArray(assay(x)))
1571688 bytes
cvanderaa commented 1 month ago

Yeeees finally !!! I had lost hope for sparse NA matrices since I read this thread: https://stackoverflow.com/questions/1274171/creating-and-accessing-a-sparse-matrix-with-na-default-entries.

How would you see this implemented? For SCP, it makes sense to always store the assay data as sparse arrays,hence this could be integrated into readSCP().

I would however hold the implementation until the functionality for NaArray has matured. For instance, I see that matrix algebra is not yet available.

lgatto commented 1 month ago

The testing/benchmarking will be part of @leopoldguyot thesis work.