broadinstitute / gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
BSD 3-Clause "New" or "Revised" License
342 stars 174 forks source link

Multi mapped reads #42

Closed gouthamatla closed 2 months ago

gouthamatla commented 4 years ago

Hi,

I found that the outFilterMultimapNmax is set to 20 in STAR alignments. I could not find anywhere in the code that only uniquely mapped reads are considered for downstream analysis of splicing. May I know which step actually retains uniquely mapped reads only ? For example, to generate leafcutter junction files, I could not find if only uniquely mapped reads are considered ? Is is the WASP sam tag vW:1 that does extract only uniquely mapped reads ?

For rnaseqc, the default value of --mapping-quality is 255, so presumably, for gene expression quantifications, rnaseqc is considering only uniquely mapped reads through --mapping-quality set to 255.

francois-a commented 4 years ago

Hi,

Only unique mapping reads were used for analyses (based on MAPQ, as you pointed out). The WASP tags are only added to variant-containing reads.

For LeafCutter, see https://davidaknowles.github.io/leafcutter/articles/Usage.html (note that the initial step was replaced with regtools: https://github.com/davidaknowles/leafcutter/blob/master/example_data/worked_example.sh)

gouthamatla commented 4 years ago

Thanks. From the script, I understood that bam2junc from leafcutter used for junctions, not the regtools. https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/leafcutter/leafcutter_bam_to_junc.wdl

Sorry if I’m mistaken but I don’t find anywhere how uniquely mapped reads were used as the initial STAR alignments used —outFilterMultimapNmax 20. Which step of the pipeline filters out multimapped reads ?

I want to analyze my data together with GTEX, so I’m trying to be as consistent as possible with GTEX processing, to make it comparable.

The command used to generate junction files, doesn't seem to filter for uniquely mapped reads.

https://github.com/broadinstitute/gtex-pipeline/blob/fe76269675895beadeac212e807d72ed7e2277d0/qtl/leafcutter/leafcutter_bam_to_junc.wdl#L19

Leafcutter scripts doesn't use MAPQ filter intrinsically while calculating junction counts. So I am curious if multimapped reads were removed or not for sQTL analysis.

RNASEQC on the other hand, has a default filter for MAPQ 255, so gene expression calculations might not have been effected by multimapped reads.