alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 506 forks source link

Chimeric.out.Junction file is the command used instead of actual junctions #1183

Open jgoldst7 opened 3 years ago

jgoldst7 commented 3 years ago

Hello, I am running into an issue trying to produce the Chimeric.out.Junction file for large files. I have a pipeline using STAR 2.7.7a with the command

STAR --genomeDir `pwd`/{params.index} \
        --runThreadN 12 \
        --readFilesIn {input.R1} {input.R2} \
        --outFileNamePrefix {params.outPrefix} \
        --outSAMtype BAM Unsorted \
        --readFilesCommand zcat \
        --quantMode TranscriptomeSAM \
        --outSAMunmapped Within \
        --chimSegmentMin 12 \
        --chimOutJunctionFormat 1 \
        --alignSJstitchMismatchNmax 5 -1 5 5 \
        --alignMatesGapMax 100000 \
        --twopassMode Basic &> {log}

This has worked in the past, with the Chimeric.out.Junction file having valid chimeric reads that were used with STAR Fusion 1.9.0 to find fusions. However, when I try to run the exact commands with very large fastq files (~10Gb for each of gzipped 2 paired end files instead of ~2 GB like I usually use, I am having issues. STAR completes fine. This is the log file

                                 Started job on |   Mar 20 15:52:11
                             Started mapping on |   Mar 20 17:38:10
                                    Finished on |   Mar 20 21:50:04
       Mapping speed, Million of reads per hour |   38.49

                          Number of input reads |   161591557
                      Average input read length |   279
                                    UNIQUE READS:
                   Uniquely mapped reads number |   153788119
                        Uniquely mapped reads % |   95.17%
                          Average mapped length |   278.95
                       Number of splices: Total |   172417208
            Number of splices: Annotated (sjdb) |   172128626
                       Number of splices: GT/AG |   170456671
                       Number of splices: GC/AG |   1466742
                       Number of splices: AT/AC |   153618
               Number of splices: Non-canonical |   340177
                      Mismatch rate per base, % |   0.24%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.93
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.66
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   4775002
             % of reads mapped to multiple loci |   2.95%
        Number of reads mapped to too many loci |   61939
             % of reads mapped to too many loci |   0.04%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   2820621
                 % of reads unmapped: too short |   1.75%
                Number of reads unmapped: other |   145876
                     % of reads unmapped: other |   0.09%
                                  CHIMERIC READS:
                       Number of chimeric reads |   902812
                            % of chimeric reads |   0.56%

There is a Chimeric.out.Junction file produced. However, what it actually contains is this instead of chimeric reads.

# 2.7.7a   STAR --genomeDir /data1/snakemake/rnaseq.raw/indexes/star_rsem --runThreadN 12 --readFilesIn rnaseq.raw/30-446545408/results/trimming/UM-HACC-2A-1_R1_val_1.fq.gz rnaseq.raw/30-446545408/results/trimming/UM-HACC-2A-1_R2_val_2.fq.gz --outFileNamePrefix rnaseq.raw/30-446545408/results/alignment/UM-HACC-2A-1/ --outSAMtype BAM Unsorted --readFilesCommand zcat --quantMode TranscriptomeSAM --outSAMunmapped Within --chimSegmentMin 12 --chimOutJunctionFormat 1 --alignSJstitchMismatchNmax 5 -1 5 5 --alignMatesGapMax 100000 --twopassMode Basic
# Nreads 161591557  NreadsUnique 153788119  NreadsMulti 4775002

Any ideas why this might be happening, possibly related to the fastq files being large? When I took a subsample of these files of just 100,000 reads (using seqtk) everything worked fine, and the Chimeric.out.Junction file was normal.

alexdobin commented 3 years ago

Hi Jeremy,

I have run a test case with 2.7.7a and your parameters, and could not reproduce this problem. A couple of things to try:

  1. Please check that you have enough disk space
  2. Run it with --outTmpKeep All option, and then check whether the Chimeric* files in _STARtmp have the junctions in them.
  3. Send me the Log.out file.

Cheers Alex

jgoldst7 commented 3 years ago

Hi Alex,

I reran with the -outTmpKeep All option on an x1e.xlarge EC2 instance with 300 Gb of disk and it worked!

It used 130Gb of the disk space and 33% of the memory so I probably did not need to go as big, but the problem before must have been not enough disk space and the job was still completing "successfully" even though it was not actually successful.

The output bam files from the failed run also had "EOF marker is absent" errors, another sign that it must have run out of space before it actually finished. Below is the log file of the successful run to see side by side. Interestingly, the successful run (with the exact same input files) has more multi-mapped reads and chimeric reads, and less uniquely mapped reads.

                      Started job on |  Mar 31 01:34:47
                             Started mapping on |   Mar 31 04:30:22
                                    Finished on |   Mar 31 11:14:33
       Mapping speed, Million of reads per hour |   23.99

                          Number of input reads |   161591557
                      Average input read length |   279
                                    UNIQUE READS:
                   Uniquely mapped reads number |   146975608
                        Uniquely mapped reads % |   90.96%
                          Average mapped length |   279.04
                       Number of splices: Total |   164914867
            Number of splices: Annotated (sjdb) |   164643795
                       Number of splices: GT/AG |   163069848
                       Number of splices: GC/AG |   1376287
                       Number of splices: AT/AC |   145863
               Number of splices: Non-canonical |   322869
                      Mismatch rate per base, % |   0.24%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.92
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.64
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   11543699
             % of reads mapped to multiple loci |   7.14%
        Number of reads mapped to too many loci |   101016
             % of reads mapped to too many loci |   0.06%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   2810593
                 % of reads unmapped: too short |   1.74%
                Number of reads unmapped: other |   160641
                     % of reads unmapped: other |   0.10%
                                  CHIMERIC READS:
                       Number of chimeric reads |   1242113
                            % of chimeric reads |   0.77%

Thanks for your help. Best, Jeremy

alexdobin commented 3 years ago

Hi Jeremy,

the problem is STAR is not checking success of writing for some of the temporary files, so if disk is full, it will silently continue and produce wrong results. I will have to fix this.

Thanks! Alex

jgoldst7 commented 3 years ago

Hi Alex,

Thanks for the information, would definitely be a good update!

Best, Jeremy