Xinglab / rmats-turbo

Other
228 stars 55 forks source link

Error running rmats -using STAR coordinate-sorted bam files #424

Open desmodus1984 opened 3 months ago

desmodus1984 commented 3 months ago

Hi,

I am interested in assessing differential splicing between two populations. I got PE-50 stranded mRNA libraries, which were mapped with STAR, and I thought on using the output bam files to run rmats. As stated I created the txt files with the bam files;

ctrl.Low.txt

123.V00796.L1Aligned.sortedByCoord.out.bam 45.V00001Aligned.sortedByCoord.out.bam 5.V00304.L1Aligned.sortedByCoord.out.bam 66.V01425Aligned.sortedByCoord.out.bam 68.V0356.R1Aligned.sortedByCoord.out.bam 86.V00750.L1Aligned.sortedByCoord.out.bam

my txt files are ctrl.H.txt and ctrl.Low.txt, and when I ran rmats:

rmats.py --b1 ctrl.H.txt --b2 ctrl.Low.txt --gtf ../GCF_011952255.1_Bvos_JDL3184-5_v1.1_genomic.gtf -t paired --readLength 50 --nthread 12 --od rmats-test --tmp temp

I got an error:

gtf: 3.5072505474090576 There are 12973 distinct gene ID in the gtf file There are 27041 distinct transcript ID in the gtf file There are 7807 one-transcript genes in the gtf file There are 236878 exons in the gtf file There are 859 one-exon transcripts in the gtf file There are 850 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 2.084406 Average number of exons per transcript is 8.759957 Average number of exons per transcript excluding one-exon tx is 9.014552 Average number of gene per geneGroup is 17.933534 statistic: 0.0022149085998535156 Fail to open 101.V01347Aligned.sortedByCoord.out.bam 120.V02191.L1Aligned.sortedByCoord.out.bam 20.V02055.L1Aligned.sortedByCoord.out.bam 44.V01947.L1Aligned.sortedByCoord.out.bam 61.V01712.L1Aligned.sortedByCoord.out.bam 89.V01926.R1Aligned.sortedByCoord.out.bam: BamReader::Open: could not open file: 101.V01347Aligned.sortedByCoord.out.bam 120.V02191.L1Aligned.sortedByCoord.out.bam 20.V02055.L1Aligned.sortedByCoord.out.bam 44.V01947.L1Aligned.sortedByCoord.out.bam 61.V01712.L1Aligned.sortedByCoord.out.bam 89.V01926.R1Aligned.sortedByCoord.out.bam BgzfStream::Open: could not open BGZF stream: BamFile::Open: could not open file handle for 101.V01347Aligned.sortedByCoord.out.bam 120.V02191.L1Aligned.sortedByCoord.out.bam 20.V02055.L1Aligned.sortedByCoord.out.bam 44.V01947.L1Aligned.sortedByCoord.out.bam 61.V01712.L1Aligned.sortedByCoord.out.bam 89.V01926.R1Aligned.sortedByCoord.out.bam Fail to open 123.V00796.L1Aligned.sortedByCoord.out.bam 45.V00001Aligned.sortedByCoord.out.bam 5.V00304.L1Aligned.sortedByCoord.out.bam 66.V01425Aligned.sortedByCoord.out.bam 68.V0356.R1Aligned.sortedByCoord.out.bam 86.V00750.L1Aligned.sortedByCoord.out.bam: BamReader::Open: could not open file: 123.V00796.L1Aligned.sortedByCoord.out.bam 45.V00001Aligned.sortedByCoord.out.bam 5.V00304.L1Aligned.sortedByCoord.out.bam 66.V01425Aligned.sortedByCoord.out.bam 68.V0356.R1Aligned.sortedByCoord.out.bam 86.V00750.L1Aligned.sortedByCoord.out.bam BgzfStream::Open: could not open BGZF stream: BamFile::Open: could not open file handle for 123.V00796.L1Aligned.sortedByCoord.out.bam 45.V00001Aligned.sortedByCoord.out.bam 5.V00304.L1Aligned.sortedByCoord.out.bam 66.V01425Aligned.sortedByCoord.out.bam 68.V0356.R1Aligned.sortedByCoord.out.bam 86.V00750.L1Aligned.sortedByCoord.out.bam

read outcome totals across all BAMs USED: 0 NOT_PAIRED: 0 NOT_NH_1: 0 NOT_EXPECTED_CIGAR: 0 NOT_EXPECTED_READ_LENGTH: 0 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 0 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 0 CLIPPED: 0 total: 0 outcomes by BAM written to: temp/2024-08-08-17_49_03_391496_read_outcomes_by_bam.txt

novel: 0.0009527206420898438 The splicing graph and candidate read have been saved into temp/2024-08-08-17_49_03391496*.rmats save: 0.0002033710479736328 Traceback (most recent call last): File "/home/juaguila/miniconda3/envs/rmats/bin/rmats.py", line 979, in main() File "/home/juaguila/miniconda3/envs/rmats/bin/rmats.py", line 945, in main run_pipe(args) File "rmatspipeline/rmatspipeline.pyx", line 4006, in rmats.rmatspipeline.run_pipe File "rmatspipeline/rmatspipeline.pyx", line 3869, in rmats.rmatspipeline.split_sg_files_by_bam File "rmatspipeline/rmatspipeline.pyx", line 3877, in rmats.rmatspipeline.split_sg_files_by_bam ValueError: invalid literal for int() with base 10: '120.V02191.L1Aligned.sortedByCoord.out.bam'

I literally created a list and then copied and pasted the names to create the txt files. I don't understand why is the error.

Thank;

EricKutschera commented 3 months ago

The files for --b1 and --b2 should each be a single line with , separated file paths: https://github.com/Xinglab/rmats-turbo/tree/v4.3.0?tab=readme-ov-file#starting-with-bam-files

It looks like your files have 1 file per line. Here's a similar issue: https://github.com/Xinglab/rmats-turbo/issues/191

desmodus1984 commented 3 months ago

Hi, I did fix the txt files, and I ran rmats again, and now I have this weird error:

rmats.py --b1 ctrl.H.txt --b2 ctrl.Low.txt --gtf ../GCF_011952255.1_Bvos_JDL3184-5_v1.1_genomic.gtf -t paired --readLength 50 --nthread 12 --od rmats-test --tmp temp gtf: 2.459548234939575 There are 12973 distinct gene ID in the gtf file There are 27041 distinct transcript ID in the gtf file There are 7807 one-transcript genes in the gtf file There are 236878 exons in the gtf file There are 859 one-exon transcripts in the gtf file There are 850 one-transcript genes with only one exon in the transcript Average number of transcripts per gene is 2.084406 Average number of exons per transcript is 8.759957 Average number of exons per transcript excluding one-exon tx is 9.014552 Average number of gene per geneGroup is 17.933534 statistic: 0.0020248889923095703

read outcome totals across all BAMs USED: 179222103 NOT_PAIRED: 64236921 NOT_NH_1: 111524948 NOT_EXPECTED_CIGAR: 1536457 NOT_EXPECTED_READ_LENGTH: 137936793 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 13366380 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 603563 CLIPPED: 29666879 total: 538094044 outcomes by BAM written to: temp/2024-08-12-19_17_37_342811_read_outcomes_by_bam.txt

novel: 151.39687323570251 The splicing graph and candidate read have been saved into temp/2024-08-12-19_17_37342811*.rmats save: 2.6668777465820312 Traceback (most recent call last): File "/home/juaguila/miniconda3/envs/rmats/bin/rmats.py", line 979, in main() File "/home/juaguila/miniconda3/envs/rmats/bin/rmats.py", line 945, in main run_pipe(args) File "rmatspipeline/rmatspipeline.pyx", line 4006, in rmats.rmatspipeline.run_pipe File "rmatspipeline/rmatspipeline.pyx", line 3869, in rmats.rmatspipeline.split_sg_files_by_bam File "rmatspipeline/rmatspipeline.pyx", line 3877, in rmats.rmatspipeline.split_sg_files_by_bam ValueError: invalid literal for int() with base 10: '120.V02191.L1Aligned.sortedByCoord.out.bam'

Could you suggest what can I do to check the file or perhaps fix it? Those files are the output of STAR, I'd prefer not to reprocess the files, unless you absolute thing that using the fastq is better.

Thanks;

EricKutschera commented 3 months ago

It looks like you're using the same --tmp temp as the previous run. That directory is used for some intermediate files and the error is happening when it reads one of those files from the run with the newline separated bams. You could either remove the files in temp or use a new directory for --tmp

desmodus1984 commented 3 months ago

Hi,

I would like to analyze the RNA-seq that I have, but the design was kinda awkward; We have low and high-elevation individuals, that were then tested for temperature response -cold, heat, and normal temp, but each treatment had individuals from both elevations. Can I reuse the bam files created with rmats to then reanalyze them?

Any suggestion you could say about my code: rmats.py --b1 ctrl.H.txt --b2 ctrl.Low.txt --gtf ../GCF_011952255.1_Bvos_JDL3184-5_v1.1_genomic.gtf -t paired --readLength 50 --nthread 12 --od rmats-test --tmp temp

Lastly, can I use the \ symbol in the text files? The file names are very large, and there is like 30 samples for each group. So it is a huge line to read, and I was hoping to use the "\" to edit in case any name was misstyped.

Thanks;

EricKutschera commented 2 months ago

Can I reuse the bam files created with rmats to then reanalyze them?

If you ran with fastq files then rMATS would put the bam files from STAR in the --tmp folder numbered based on the order in your --s1 and --s2 files: https://github.com/Xinglab/rmats-turbo/blob/v4.3.0/rmats.py#L57

can I use the \ symbol in the text files?

rMATS doesn't handle \ for splitting a line. The code expects the --b1 and --b2 files to be a single line with , separated file paths