alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Load STARsolo EM output into Seurat: element type 'float' not recognized #1397

Open MarcusLCC opened 3 years ago

MarcusLCC commented 3 years ago

Hi Alex,

Thanks for developing such an amazing tool. I run STARsolo on the single cell RNAseq data. Since I want to have more insights into two similar genes, I added --soloMultiMappers Uniform PropUnique EM Rescue to see how it goes.

However, when I tried to load the result into Seurat, I found out that the 'float' in the UniqueAndMult-EM.mtx.gz output by STARsolo might not be supported, since the error showed

> seurat = Read10X("../starsolo_out/ControlASolo.out/Gene/raw/EM/")
Error in readMM(file = matrix.loc) : element type 'float' not recognized

I tried with the new ReadSTARsolo function, but still get the same result.

> control_em = ReadSTARsolo("../starsolo_out/ControlASolo.out/Gene/raw/EM")
Error in readMM(file = all.files[["expression matrix"]]) : 
  element type 'float' not recognized

And here is the top rows in the file UniqueAndMult-EM.mtx.gz

%%MatrixMarket matrix coordinate float general
%
25110 6794880 18644153
9793 8 1
9979 16 1
5029 23 1
5031 23 1
6323 23 1
9780 23 1
11862 23 1
25101 23 1
21038 33 1
22510 50 1
5029 55 1
6308 55 1
5027 62 1
5028 71 0.5
5031 71 0.5
5029 73 1.99219
5032 73 0.0078125
13923 73 1
22727 73 1.99219
22729 73 0.0078125

The Barcodes.stat is:

                                        nNoAdapter              0
                                            nNoUMI              0
                                             nNoCB              0
                                            nNinCB             41
                                           nNinUMI        1408113
                                   nUMIhomopolymer          93279
                                          nTooMany              0
                                          nNoMatch        7936486
                               nMismatchesInMultCB              0
                                       nExactMatch      272044289
                                    nMismatchOneWL        3242653
                                 nMismatchToMultWL       13643072

And the command I use to run STARsolo is:

        STAR \
        --genomeDir ${star_ref} \
        --readFilesIn ${fastq_input_dir}/${s}_*_R2_*.fastq.gz ${fastq_input_dir}/${s}_*_R1_*.fastq.gz \
        --readFilesCommand zcat \
        --soloType CB_UMI_Simple \
        --soloCBwhitelist ${barcode_whitelist_3Pv3} \
        --soloUMIlen 12 \
        --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
        --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
        --soloUMIfiltering - \
        --soloUMIdedup 1MM_CR \
        --soloMultiMappers Uniform PropUnique EM Rescue \
        --outFileNamePrefix ${out_dir}/${s} \
        --runThreadN ${thread}

I wonder if the STARsolo result with --soloMultiMappers parameters is compatible with Seurat? Is there a way I can get through this error?

Much appreciated,

Marcus

alexdobin commented 3 years ago

Hi Marcus,

the multimappers are included probabilistically, that's why the "counts" become non-integer. Ideally, the downstream software should import real numbers, since at some point the counts are normalized and become real numbers. It would be good to ask Seurat developers for such a feature. In the meantime you can simply round the values to integers, e.g. awk '{if (NR>3) $3=int($3+0.5); print}' matrix.mtx > matrix_rounded.mtx Note, that you will also need to perform cell filtering (calling) on the rounded matrix, since at the moment multimap options only output raw matrix.

Cheers Alex

davhum commented 3 years ago

Timely thread as had similar issue today. Small adjustment to Alex's short term solution which also amends mtx header:

awk '{if (NR>3) $3=int($3+0.5); print}' matrix.mtx | sed 's/float/integer/' > matrix_rounded.mtx

Cheers, D

bugneville commented 3 years ago

I brought this up in #1369 -- you appear to be able to simply change the header at the top of the file from float to real with no obvious negative impact?: change %%MatrixMarket matrix coordinate float general to %%MatrixMarket matrix coordinate real general

This way, you do not need to round the inputs.

alexdobin commented 3 years ago

Hi David, @bugneville

thanks for the correction. The matrix market format indeed specifies "real" not "float" label. I will fix it in the next release.

Cheers Alex