Xinglab / rmats-turbo

Other
208 stars 48 forks source link

the sequencing library is vague #412

Open xiaobearxiaobear opened 1 week ago

xiaobearxiaobear commented 1 week ago

My RNA-seq data were downloaded from the Internet, and the authors did not elaborate on the type of library built. for --libtype parameter, i try two ways. one: python3 /public/home/xiaoxiong/miniconda3/envs/libgsl/rMATS/rmats.py --b1 b1.txt --b2 b2.txt --gtf /public/home/xiaoxiong/hTSC_STB96_RNA_seq/hTSC_STB96_rMATs_Ensembol/STB72_STB96/genes.gtf -t paired --libType fr-firststrand --readLength 100 --nthread 4 --od AS_libgsl_firststrand --tmp AS_libgsl_firststrand --variable-read-length gtf: 153.78015661239624 There are 36601 distinct gene ID in the gtf file There are 199138 distinct transcript ID in the gtf file There are 14315 one-transcript genes in the gtf file There are 1305354 exons in the gtf file There are 5603 one-exon transcripts in the gtf file There are 3566 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 5.440780 Average number of exons per transcript is 6.555022 Average number of exons per transcript excluding one-exon tx is 6.715845 Average number of gene per geneGroup is 7.854318 statistic: 0.42818236351013184

read outcome totals across all BAMs USED: 583925661 NOT_PAIRED: 5064400 NOT_NH_1: 270852260 NOT_EXPECTED_CIGAR: 6728411 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 45292289 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 1269436 CLIPPED: 65970785 total: 979103242 outcomes by BAM written to: AS_libgsl_firststrand/2024-06-19-13_12_09_367167_read_outcomes_by_bam.txt

novel: 2397.6510169506073 The splicing graph and candidate read have been saved into AS_libgsl_firststrand/2024-06-19-13_12_09367167*.rmats save: 28.220320224761963 loadsg: 17.720765352249146

========== Done processing each gene from dictionary to compile AS events Found 89553 exon skipping events Found 11671 exon MX events Found 18472 alt SS events There are 11225 alt 3 SS events and 7247 alt 5 SS events. Found 6930 RI events

ase: 5.3955464363098145 count: 153.7568280696869 Processing count files. Done processing count files. 4773f9cf07b10c90bd82196e3074a97 4

another python3 /public/home/xiaoxiong/miniconda3/envs/libgsl/rMATS/rmats.py --b1 b1.txt --b2 b2.txt --gtf /public/home/xiaoxiong/hTSC_STB96_RNA_seq/hTSC_STB96_rMATs_Ensembol/hTSC_STB72/genes.gtf -t paired --libType fr-secondstrand --readLength 100 --nthread 4 --od AS_libgsl_second --tmp AS_libgsl_second --variable-read-length gtf: 1452.7515552043915 There are 36601 distinct gene ID in the gtf file There are 199138 distinct transcript ID in the gtf file There are 14315 one-transcript genes in the gtf file There are 1305354 exons in the gtf file There are 5603 one-exon transcripts in the gtf file There are 3566 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 5.440780 Average number of exons per transcript is 6.555022 Average number of exons per transcript excluding one-exon tx is 6.715845 Average number of gene per geneGroup is 7.854318 statistic: 0.08252692222595215

read outcome totals across all BAMs USED: 54658238 NOT_PAIRED: 5238074 NOT_NH_1: 266573095 NOT_EXPECTED_CIGAR: 6915951 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 443321716 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 131377399 CLIPPED: 64958367 total: 973042840 outcomes by BAM written to: AS_libgsl_second/2024-06-19-13_19_27_769310_read_outcomes_by_bam.txt

novel: 2378.783494949341 The splicing graph and candidate read have been saved into AS_libgsl_second/2024-06-19-13_19_27769310*.rmats save: 3.096289873123169 loadsg: 1.7018141746520996

========== Done processing each gene from dictionary to compile AS events Found 47791 exon skipping events Found 3302 exon MX events Found 16193 alt SS events There are 9849 alt 3 SS events and 6344 alt 5 SS events. Found 6783 RI events

ase: -3.3512372970581055 count: 5.232464551925659 Processing count files. Done processing count files. 12 3

so, I don't know if that is the right method, or if I should try it again -libType fr-unstranded. do you have a good suggestion?

EricKutschera commented 1 week ago

When you ran with --libType fr-firststrand 583 million alignments were used:

USED: 583925661
NOT_PAIRED: 5064400
NOT_NH_1: 270852260
NOT_EXPECTED_CIGAR: 6728411
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 45292289
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 1269436
CLIPPED: 65970785
total: 979103242

When you ran with --libType fr-secondstrand only 54 million alignments were used:

USED: 54658238
NOT_PAIRED: 5238074
NOT_NH_1: 266573095
NOT_EXPECTED_CIGAR: 6915951
NOT_EXPECTED_READ_LENGTH: 0
NOT_EXPECTED_STRAND: 0
EXON_NOT_MATCHED_TO_ANNOTATION: 443321716
JUNCTION_NOT_MATCHED_TO_ANNOTATION: 131377399
CLIPPED: 64958367
total: 973042840

The main differences are about 400 million EXON_NOT_MATCHED_TO_ANNOTATION and 130 million JUNCTION_NOT_MATCHED_TO_ANNOTATION when run with --libType fr-secondstrand. If the library was actually unstranded then I would expect to see similar numbers of NOT_MATCHED_TO_ANNOTATION for fr-firststrand and fr-secondstrand. Based on the output it seems that your data should be run with --libType fr-firststrand