New method for EM quantification

osvaldoreisss commented 1 year ago

Dear Alex Dobin.

First of all, thanks for this fantastic tool and all your work during these years to update it.

Recently some colleagues attended the conference Biological Data Science 2022 and they were comment on the new EM algorithm for counting multi-gene reads in STAR Aligner, that should work well for 3’ bias sequencing like 10x single-cell. I have some questions that I you could address:

Is this new algorithm already released in some new version of STAR?

It only works with 3’ bias sequencing or should work for whole coverage RNASeq as well?

I understood that it uses the unique-transcript assigned reads to calculate the distribution for a given dataset and the EM algorithm assign the multi-reads to specific transcript accordingly to this distribution. But at the end we have quantification at gene-level. Any plan for some method for transcript-level quantification?

Do you have some documentation about this method that is public?

Thanks a lot.

Best regard,

alexdobin commented 1 year ago

Hi Osvaldo, the new method is not released publicly yet, we hope to do it by the end of January. Initially, it will only work with the 3'-biased technologies, but we will likely extend it to other cases. This method will produce gene expression directly without calculating transcript expression. We also are working on another algorithm that will produce transcript expression, with an ETA of ~6 months.

cnk113 commented 1 year ago

Hey Alex,

I was wondering if this module/update would be out soon? I have a dataset with really high MM rate, and I do wonder the improvements I see with this new method?

Thanks, Chang

alexdobin commented 1 year ago

Hi Chang,

We are working actively on it, and hope to release it within 1-2 months. In simulations, we see some substantial reduction of error (~2-fold) for multi-gene reads.

DarioS commented 1 year ago

Hopefully, the user interface will make it easy to quantify highly polymorphic regions of the genome, such as HLA and KIR gene families. I have been using my own workaround of masking those in the reference genome sequence with Ns and then doing a second pass with the unmapped FASTQ files against IMGT/HLA sequence database followed by RSEM. The hack which I devised has perfect concordance to Sanger sequencing for samples with a ground truth, so something similar in STAR would be great.

alexdobin / STAR

New method for EM quantification #1721