alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Results from EM method: which output matrix should be used? #1752

Open ColeWunderlich opened 1 year ago

ColeWunderlich commented 1 year ago

Hello,

I was wondering which matrix should be used for final quantification results when running STARsolo with the EM mode enabled.

I noticed in the raw directory there is a UniqueAndMult-EM.mtx file that has fractional counts which one would expect from EM. In the filtered directory, however, there is only one matrix and all of the counts appear to be integer.

Does the final filter matrix reflect the incorporation of the EM results with some sort of rounding applied? Or is the EM output only reflected in the raw/UniqueAndMult-EM.mtx and the filter results are unique counts only?

alexdobin commented 1 year ago

Hi Cole,

Cell filtering (a.k.a. cell calling) is not done for the multi-gene outputs. You can do it with a separate STAR command as explained here: https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md#cell-filtering-of-previously-generated-raw-matrix Another option is to simply use the cell barcodes that were called for unique mappers, which should not make a big difference.

ColeWunderlich commented 1 year ago

Hey Alex,

Thanks for the reply! I am not necessarily worried about cell calling but rather (final) quantification. The fact you are pointing me toward cell calling, however, may have revealed my misunderstanding. Would you say the following is true (when full EM has been run)?

Solo.out/<method>/raw

Solo.out/<method>/filtered

If the above is true, is there a way to get STARsolo to do the filtering on the UniqueAndMult-EM.mtx? The command you linked to just has the option to supply the raw directory but not to specify which matrix to use.

I tried making a new folder and using symlinks so that UniqueAndMult-EM.mtx was renamed to matrix.mtx but the re-calling didn't work. For some reason it returned only 140 cells despite the normal matrix.mtx returning ~5k cells.

alexdobin commented 1 year ago

Hi Cole,

These statements are correct.

CBs for each matrix have been corrected? (not sure on this one) Yes by default, controlled by --soloCBmatchWLtype option. Uncorrected CB (or ones that failed to correct) are not reported? (also not sure) They can be reported in the BAM output, but not in the count matrix.

The filtering works in my examples - however, I realized that it outputs the matrix rounded to integers, unlike the original EM matrix that contains non-integer values. I will need to fix this, but at the moment the simplest way is to use the filtered cells based on unique counts, and extract them from the EM matrix.

ColeWunderlich commented 1 year ago

Hey Alex,

Thanks for getting back to me. I will have to double check how I was trying to filter the EM matrix, but will go with subsetting for now.

Also, just to make sure, the <raw|filtered>/matrix.mtx contains only unique counts (ie derived from only unique reads) right?

alexdobin commented 1 year ago

Hi Cole,

Also, just to make sure, the <raw|filtered>/matrix.mtx contains only unique counts (ie derived from only unique reads) right?

That's correct.