Open hsun3163 opened 2 years ago
The error message this will produce is:
Warning: The SAM/BAM file declares less reference sequences (58678) than RSEM knows (234381)!
and
RSEM can not recognize reference sequence name ENSG00000223972!
Beside the toTranscriptome.bam, another file that will be impacted is the .SJ.out.tab
which will be needed by psichomics to generate the sQTL.
hs3163@node65:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing$ wc output/rnaseq_arch/22.SJ.out.tab
12643 113787 477595 output/rnaseq_arch/22.SJ.out.tab
hs3163@node65:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing$ wc output/rnaseq/Sample_22.SJ.out.tab
13191 118719 497187 output/rnaseq/Sample_22.SJ.out.tab
in principle however, the alternative to psichomics, leafcutter, shall not be impacted for it used the Aligned.byCoordinate.bam. Which should be independent of the gtf.
@Rhopala could you update the ticket with supporting evidence while u have it?
So, toTranscriptome and SJ.out.tab are not to be used at the moment?
So, toTranscriptome and SJ.out.tab are not to be used at the moment?
Unfortunately, yes
@Shrishtee-kandoi let's focus on generating results for leafcutter for the upcoming release.
Sure, yes !
As it turns out, the output of the leafcutter is indeed impacted. Am verifying whether the result for RNAseq will be. It really should not because rnaseqc uses the same gene.gtf anyway.
After testing we found the Aligned.byCoordinated.bam
files are actually impacted. Only tiny differences are observed in the bamfiles such as:
The intron usage ratio could have a ±1 for numerator and denominators:
and cause extensive value change & row number change in final phenotype table:
Let's call the what we have before using the collapsed gtf the impacted bam and then the good bam
The difference in the gene_count table for gene expression. 1298/1372 from the impacted bam and 1304/1382 from the good bam are correctly mapped to chrom 21 and 22. 14 out of 1294 genes have a correlation < 0.9. In summary, there are not huge differences but there are differences.
Originally, to generate the STAR_Index and RSEM_index, the same gtf, the gtf prior to collapse by gene should be used. However, as we decided to do STAR_indexing on the fly, it slip my mind and gave a wrong example in the RNA_calling pipeline test MWE
reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf
.Therefore, all the
Aligned.toTranscript.bam
that were generated by Traves and Shrithee can not be used to conduct RSEM analysis.Luckily, the main expression matrix we used is the one from rnaseqc, based on the
Aligned.byCoordinated.bam
, which should not be impacted by the gene vs transcriptome.To fix it, STAR_align (Not the star_output since that will overwrite the existing md.bam which will take double the time) needs to be rerun to regenerate the Aligned.toTranscript.bam file (edited)
The consequence is that the RSEM bed table will not be generated on time