Misspecification gtf in STAR_outline MWE of RNA_calling,

cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium

https://cumc.github.io/xqtl-protocol/

MIT License

38 stars 42 forks source link

Misspecification gtf in STAR_outline MWE of RNA_calling, #390

Open hsun3163 opened 2 years ago

hsun3163 commented 2 years ago

Originally, to generate the STAR_Index and RSEM_index, the same gtf, the gtf prior to collapse by gene should be used. However, as we decided to do STAR_indexing on the fly, it slip my mind and gave a wrong example in the RNA_calling pipeline test MWE reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf.

Therefore, all the Aligned.toTranscript.bam that were generated by Traves and Shrithee can not be used to conduct RSEM analysis.

Luckily, the main expression matrix we used is the one from rnaseqc, based on the Aligned.byCoordinated.bam, which should not be impacted by the gene vs transcriptome.

To fix it, STAR_align (Not the star_output since that will overwrite the existing md.bam which will take double the time) needs to be rerun to regenerate the Aligned.toTranscript.bam file (edited)

The consequence is that the RSEM bed table will not be generated on time

hsun3163 commented 2 years ago

The error message this will produce is: Warning: The SAM/BAM file declares less reference sequences (58678) than RSEM knows (234381)! and RSEM can not recognize reference sequence name ENSG00000223972!

hsun3163 commented 2 years ago

Beside the toTranscriptome.bam, another file that will be impacted is the .SJ.out.tab which will be needed by psichomics to generate the sQTL.

hs3163@node65:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing$ wc output/rnaseq_arch/22.SJ.out.tab
 12643 113787 477595 output/rnaseq_arch/22.SJ.out.tab
hs3163@node65:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing$ wc output/rnaseq/Sample_22.SJ.out.tab
 13191 118719 497187 output/rnaseq/Sample_22.SJ.out.tab

in principle however, the alternative to psichomics, leafcutter, shall not be impacted for it used the Aligned.byCoordinate.bam. Which should be independent of the gtf.

@Rhopala could you update the ticket with supporting evidence while u have it?

Shrishtee-kandoi commented 2 years ago

So, toTranscriptome and SJ.out.tab are not to be used at the moment?

hsun3163 commented 2 years ago

So, toTranscriptome and SJ.out.tab are not to be used at the moment?

Unfortunately, yes

gaow commented 2 years ago

@Shrishtee-kandoi let's focus on generating results for leafcutter for the upcoming release.

Shrishtee-kandoi commented 2 years ago

Sure, yes !

hsun3163 commented 2 years ago

As it turns out, the output of the leafcutter is indeed impacted. Am verifying whether the result for RNAseq will be. It really should not because rnaseqc uses the same gene.gtf anyway.

Rhopala commented 2 years ago

After testing we found the Aligned.byCoordinated.bam files are actually impacted. Only tiny differences are observed in the bamfiles such as: The intron usage ratio could have a ±1 for numerator and denominators: and cause extensive value change & row number change in final phenotype table:

hsun3163 commented 2 years ago

Let's call the what we have before using the collapsed gtf the impacted bam and then the good bam

The difference in the gene_count table for gene expression. 1298/1372 from the impacted bam and 1304/1382 from the good bam are correctly mapped to chrom 21 and 22. 14 out of 1294 genes have a correlation < 0.9. In summary, there are not huge differences but there are differences.