bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

Identical miRNA read counts versus multimapping #2893

Closed mxhp75 closed 4 years ago

mxhp75 commented 5 years ago

Hi @lpantano

I'm trying to understand a little better how the bcbio-nextgen smallRNA pipeline deals with multi-mapping sequences. I have a dataset of 96 samples, smallRNA-seq (QIAGEN), processed through the bcbio-nextgen smallRNA pipeline from fastq.gz files through to counts tables using default parameters and including trimming the adapter. For each of my 96 samples I have a number of mature miRNA (from the C19MC which have identical mature sequences) with identical counts. For example - my raw counts per million table looks something like this: ,sample1,sample2,sample3,sample...n hsa-miR-519b-5p,1125,1395,1728,1525 hsa-miR-519c-5p,1125,1395,1728,1525 hsa-miR-522-5p,1125,1395,1728,1525 I have checked the miRBase database and I can see that the mature sequence is identical for the miRNA in question and I can understand that the short reads we are talking about would easily map to all of these locations however, from reading the documentation, I though the bcbio-nextgen smallRNA pipeline would avoid counting these transcripts in multiple locations.

Are you able to shed any light on this? My main concern is the inflation of the raw library size and the affect on normalisation as well as simply reporting miRNA that are not actually there, especially because these are amongst the highest expressed miRNA in my samples.

Any support in this matter is appreciated.

Kind regards

Melanie

lpantano commented 5 years ago

Hi Melanie,

This is a normal case, and I have to say very few of them end up like this.

The tool that doesn’t count multiple times the sequences is seqcluster, but that is focus to be a general characterization tool, so it is good to study other non-miRNAs.

This case, is one of the few that the miRNA is named differently even having the same mature miRNA, in other cases, there will be different precursors generating the same miRNA, but the name in this cases is the same.

I am aware of that, and meanwhile the miRTop group is working on that, if you have same exact values, you can keep only one when you do the DE analysis.

I hope we can fix this during summer otherwise.

Thanks for the email!

mxhp75 commented 5 years ago

Hi Lorena

Thank you for the prompt reply, and my apologies for conflating the two seperate tools in the pipeline. Your help is always appreciated.

Kindest regards

Melanie