broadinstitute / gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
BSD 3-Clause "New" or "Revised" License
343 stars 175 forks source link

What's the difference between gencode.vXX.GRCh38.genes.gtf and gencode.vXX.GRCh38.genes.collapsed_only.gtf? #57

Closed xiyasong closed 3 years ago

xiyasong commented 3 years ago

Hi ! Now I am a little bit confused with which genes_gtf files should be used in gene-level quantification by RNA-SeQC and eqtl analysis by fastqtl? Are those genes.gtf different with genes.collapsed_only.gtf?

Is that correct to use gencode.v26.GRCh38.genes.collapsed_only.gtf when running RNA-SeQC and gencode.v26.GRCh38.genes.gtf when running eqtl pipeline(I want to use the same GENCODE v26 as GTEx V8 used)? Because it seems I should use collapse_only mode when I run RNA-SeQC as described in TOPMed_RNAseq_pipeline.md, but in this md file's end, it said in Appendix: wrapper scripts from the GTEx pipeline : "genes_gtf: path to the collapsed, gene-level GTF (gencode.v30.GRCh38.ERCC.genes.gtf as described above)".

Also, does add ERCC or not affect the results? Thank you for your help!!

francois-a commented 3 years ago

Hi, This depends on whether your RNA-seq data was generated with a stranded or unstranded protocol. The "collapsed_only" version should be used for stranded data only (see https://github.com/broadinstitute/gtex-pipeline/blob/master/gene_model/collapse_annotation.py for how this is generated). In general, I recommend using the same annotation/version for all analyses. ERCC are spike-in controls; if this annotation is used with data generated without these controls, the resulting counts should be zero and won't affect the results.

xiyasong commented 3 years ago

Hi! Thank you for your explanation! So if I understand correctly, I should use gencode.vXX.GRCh38.genes.gtf for unstranded RNA-seq data right ?

francois-a commented 3 years ago

Yes that's correct.

xiyasong commented 3 years ago

Thank you very much !!