alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 505 forks source link

2nd pass takes hours when 1st pass only a few minutes #733

Closed hermidalc closed 2 years ago

hermidalc commented 5 years ago

I'm using STAR 2.7.2b and have been aligning SRA study SRP166889. Unlike others I've done it seems like the reads from this study cause STAR to have a very slow 2nd pass. Take for example from the study SRR8112647:

STAR \
--runThreadN 12 \
--readFilesIn SRR8112647_pass_1.fastq.gz SRR8112647_pass_2.fastq.gz \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignTranscriptsPerReadNmax 100000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--chimJunctionOverhangMin 15 \
--chimMainSegmentMultNmax 1 \
--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \
--chimSegmentMin 15 \
--genomeDir 'star_grch38p2_d1_vd1_gtfv22' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outSAMattrRGline "ID:SRR8112647" "SM:SRS3982919" "PL:ILLUMINA" "PM:Illumina HiSeq 3000" \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outFilterType BySJout \
--outSAMattributes NH HI AS nM NM ch \
--outSAMstrandField intronMotif \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--quantMode GeneCounts \
--readFilesCommand zcat \
--twopassMode Basic
Sep 07 01:40:01 ..... started STAR run
Sep 07 01:40:01 ..... loading genome
Sep 07 01:40:44 ..... started 1st pass mapping
Sep 07 01:59:22 ..... finished 1st pass mapping
Sep 07 01:59:23 ..... inserting junctions into the genome indices
Sep 07 02:01:18 ..... started mapping
Sep 07 05:12:01 ..... finished mapping
Sep 07 05:12:03 ..... finished successfully

When looking at Log.progress.out 2nd pass M/hr is ~30 M/hr, almost an order of magnitude slower than 1st pass ~250 M/hr. What could be the cause of this?

GRCh38 genome index created using NCI GDC source files and standard params:

STAR \
--runThreadN 12 \
--runMode genomeGenerate \
--genomeDir star_grch38p2_d1_vd1_gtfv22 \
--genomeFastaFiles GRCh38.d1.vd1.fa \
--sjdbGTFfile gencode.v22.annotation.gtf \
--sjdbOverhang 100

SRA paired-end FASTQs created using recommended standard params:

parallel-fastq-dump \
--threads 12 \
--sra-id SRR8112647.sra \
--gzip \
--skip-technical \
--readids \
--read-filter pass \
--dumpbase \
--split-3 \
--clip
alexdobin commented 5 years ago

Hi Leandro,

how many junctions were detected in the 1st pass - you can check it in the Log.out file or by counting lines of _STARpass1/SJ.out.tab ? Large number of junctions (>500k) will slow down the 2nd pass mapping.

Another possibility that there is a highly expressed locus with many novel junctions, typically false positive junctions - e.g. in mitochondrion genome, or in rRNA loci.

In any case, you would need to filter the SJ.out.tab file before the 2nd pass mapping, so you would need to do "manual" 2-pass mapping (without --twopassMode option):

  1. Map 1st pass with standard parameters, without BAM/SAM output (--outSAMtype None), without --chim* options and without --outFilterType BySJout.
  2. Filter the resulting SJ.out.tab file. I would start with removing all junctions mapping to chrM and non-chromosomal contigs.
  3. Run second pass (in a separate directory) with all the output options and --sjdbFileChrStartEnd /path/to/SJ.out.tab.filtered.

Cheers Alex

hermidalc commented 5 years ago

Dear @alexdobin - for this study the read length is 2x150bp, does it matter significantly that the genome index was created with default --sjdbOverhang 100? Also, I noticed in the mapping of runs from this study that compared to other studies where unique mapping is > 90% this one gets ~70% with ~30% multi-mapped. Is there something I'm not aware of that is happening or is this normal for human data?

hermidalc commented 5 years ago

@alexdobin

how many junctions were detected in the 1st pass - you can check it in the Log.out file or by counting lines of _STARpass1/SJ.out.tab ? Large number of junctions (>500k) will slow down the 2nd pass mapping.

Sorry I'm confused about something here. The number of lines in the SJ.out.tab file:

$ wc -l SRR8112647/_STARpass1/SJ.out.tab 
365671 SRR8112647/_STARpass1/SJ.out.tab

is much less than the number of junctions reported in Log.out, which is either 713112, or 146928 + 347441 = 494369?

Sep 11 18:00:03   Loaded database junctions from the generated genome star_grch38p2_d1_vd1_gtfv22/sjdbList.out.tab: 347441 total junctions

Sep 11 18:00:03   Loaded database junctions from the 1st pass file: SRR8112647/_STARpass1//SJ.out.tab: 713112 total junctions

WARNING: long repeat for junction # 250346 : chr15 88855522 88855578; left shift = 14; right shift = 255
WARNING: long repeat for junction # 292573 : chr18 46969610 47029116; left shift = 183; right shift = 255
WARNING: long repeat for junction # 592560 : chr15 88855522 88855578; left shift = 14; right shift = 255
Sep 11 18:00:03   Finished preparing junctions
Sep 11 18:00:03 ..... inserting junctions into the genome indices
Sep 11 18:00:37   Finished SA search: number of new junctions=146928, old junctions=347441
Sep 11 18:00:51   Finished sorting SA indicesL nInd=58762556
Genome size with junctions=3915398377  3816030208   99368169
GstrandBit1=32   GstrandBit=32
Sep 11 18:01:49   Finished inserting junction indices
Sep 11 18:01:59   Finished SAi
Sep 11 18:02:00 ..... finished inserting junctions into genome
hermidalc commented 5 years ago

Dear @alexdobin -

In any case, you would need to filter the SJ.out.tab file before the 2nd pass mapping, so you would need to do "manual" 2-pass mapping (without --twopassMode option):

I've done the manual 2-pass with SJ filtering as you suggested, though it hasn't helped really with the 2nd pass mapping speed, it's still ~30-40 M/hr compared to ~250 M/hr in the 1st pass.

Here are the steps I did following your guidance, please tell me if any options are missing or shouldn't be there:

  1. Map 1st pass with standard parameters, without BAM/SAM output (--outSAMtype None), without --chim* options and without --outFilterType BySJout.
STAR \
--runThreadN 12 \
--readFilesIn 'SRR8112647/SRR8112647_pass_1.fastq.gz' 'SRR8112647/SRR8112647_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--genomeDir 'star_grch38p2_d1_vd1_gtfv22_rl150' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR8112647/_STARpass1/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outSAMtype None \
--readFilesCommand zcat
  1. Filter the resulting SJ.out.tab file. I would start with removing all junctions mapping to chrM and non-chromosomal contigs.
awk '{ if ($1 ~ /^chr([1-9][0-9]?|X|Y)$/) { print } }' 'SRR8112647/_STARpass1/SJ.out.tab' > 'SRR8112647/_STARpass1/SJ.filtered.out.tab'

Only a few thousand junctions were removed:

$ wc -l SRR8112647/_STARpass1/SJ.out.tab
365607 SRR8112647/_STARpass1/SJ.out.tab
$ wc -l SRR8112647/_STARpass1/SJ.filtered.out.tab 
357141 SRR8112647/_STARpass1/SJ.filtered.out.tab
  1. Run second pass (in a separate directory) with all the output options and --sjdbFileChrStartEnd /path/to/SJ.out.tab.filtered.
STAR \
--runThreadN 12 \
--readFilesIn 'SRR8112647/SRR8112647_pass_1.fastq.gz' 'SRR8112647/SRR8112647_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--chimJunctionOverhangMin 15 \
--chimMainSegmentMultNmax 1 \
--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \
--chimSegmentMin 15 \
--genomeDir 'star_grch38p2_d1_vd1_gtfv22_rl150' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR8112647/_STARpass2/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outFilterType BySJout \
--outSAMattrRGline "ID:SRR8112647" "SM:SRS3982919" "PL:ILLUMINA" "PM:Illumina HiSeq 3000" \
--outSAMattributes NH HI AS nM NM ch \
--outSAMstrandField intronMotif \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--quantMode GeneCounts \
--readFilesCommand zcat \
--sjdbFileChrStartEnd 'SRR8112647/_STARpass1/SJ.filtered.out.tab'

Again in the Log.out file it's confusing what is the total number of junctions after inserting:

Sep 11 20:42:47   Loaded database junctions from the generated genome star_grch38p2_d1_vd1_gtfv22_rl150/sjdbList.out.tab: 347441 total junctions

Loaded database junctions from file: SRR8112647/_STARpass1/SJ.filtered.out.tab, total number of junctions: 704582 junctions

Sep 11 20:42:47   Loaded database junctions from the pGe.sjdbFileChrStartEnd file(s), 704582 total junctions

WARNING: long repeat for junction # 250346 : chr15 88855522 88855578; left shift = 14; right shift = 255
WARNING: long repeat for junction # 292573 : chr18 46969610 47029116; left shift = 183; right shift = 255
WARNING: long repeat for junction # 593204 : chr15 88855522 88855578; left shift = 14; right shift = 255
Sep 11 20:42:48   Finished preparing junctions
Sep 11 20:42:48 ..... inserting junctions into the genome indices
Sep 11 20:43:39   Finished SA search: number of new junctions=138464, old junctions=347441
Sep 11 20:43:58   Finished sorting SA indicesL nInd=82521930
Genome size with junctions=3961315803  3816030208   145285595
GstrandBit1=32   GstrandBit=32
Sep 11 20:44:59   Finished inserting junction indices
Sep 11 20:45:11   Finished SAi
Sep 11 20:45:12 ..... finished inserting junctions into genome
hermidalc commented 5 years ago

I've looked at whether runs which have slow 2nd pass mapping rate have any rRNA contamination with bbduk but they do not. I've also found two more example runs from another study SRP070710 that have slow 2nd pass mapping even though they do not have a large number of new junctions SRR3184294 and SRR3184295. Worth looking at if you ever have time to see what STAR is doing that causes a huge slow down in mapping rate between the 1st and 2nd pass.

alexdobin commented 5 years ago

Hi Leandro,

using --sjdbOverhang 100 for 2x150 reads is fine, it changes only a very small portion of alignments.

30% multimapping reads is on the high side for normal human data. You need to compare other mapping statistics in the Log.final.out file to see if anything else is different between the samples. What is the mapping rate for the 1st pass?

365,607 is the total number of detected junctions (novel and annotated). In the 2nd pass, Log.out reports number of new junctions=138464, old junctions=347441 The old junctions are the annotated (from GTF) and new 138464 junctions are (only) novel junctions from the SJ.out.tab.filtered. Note that you can filter out all the annotated junctions from the SJ.out.tab (i.e. keep only col6==0) since all annotated junctions are taken from the GTF file.

There are further filtering steps to increase the mapping speed of the 2nd pass, if you are willing to try them. I am not sure if it's worse the time though. :) First, I would try --outFilterMatchNminOverLread 0.66 (default). It is possible that the 0.33 value allows a number of false junctions which slows down the 2nd pass. The next one is to take out all non-canonical junctions (i.e. keep col5>0 in SJ.out.tab) and junction supported by only multimappers (i.e. keep col7>0). If this does not help, you can try to filter out junctions supported by few reads, i.e. do a series of runs keeping col7>=2,3,4...

If none of the above helps, I will try to look into it when I have time, since this is public data. I have encountered a few cases where the slowdown was caused by an abnormally highly expressed locus in the genome with a few splice junctions.

Cheers Alex

hermidalc commented 5 years ago

@alexdobin thanks for the advice.

30% multimapping reads is on the high side for normal human data. You need to compare other mapping statistics in the Log.final.out file to see if anything else is different between the samples. What is the mapping rate for the 1st pass?

All the runs from the public study SRP166889 have a 1st pass mapping rate is ~250 M/hr, 2nd pass ~30 M/hr, and a multi-mapping of ~30%. I found no significant rRNA contamination so not sure why.

For your own personal interest a single good example is from another study SRP094781. I would look at is SRR5088841. It has >250 M/hr 1st pass but then on the 2nd pass it had the slowest speed I've seen 15-20 M/hr. It has a good unique mapping rate ~90% and the 1st pass didn't add many junctions. So very interesting to know what in this run is causing STAR to go much slower.

hermidalc commented 5 years ago

Sorry edited above the good example run should be SRR5088841

hermidalc commented 5 years ago

365,607 is the total number of detected junctions (novel and annotated). In the 2nd pass, Log.out reports number of new junctions=138464, old junctions=347441 The old junctions are the annotated (from GTF) and new 138464 junctions are (only) novel junctions from the SJ.out.tab.filtered.

Need to ask you here, could you explain how these numbers are related? Since you said 365,607 = novel + annotated, and annotated = 347441 and novel = 138464 then the numbers don't add up.

The next one is to take out all non-canonical junctions (i.e. keep col5>0 in SJ.out.tab)

I'm not familiar with what the difference is between canonical and non-canonical junctions, but from a quick background read it seems that non-canonical junctions are important? Sorry I'm just trying to make sure that whatever I filter out isn't changing the end results (which currently in my case are the gene read count quants)

alexdobin commented 5 years ago

Hi Leandro,

old junctions=347441 is the total number of annotated junctions from GTF, not all of them are detected. new junctions=138464 are novel detected junctions 365607 = wc -l SJ.out.tab = total number of detected junctions = new junctions + detected annotated junctions detected annotated junctions= 365607-138464 = subset of 347441 all annotated junctions.

What I call non-canonical junctions are non GT/AG, GC/AG, AT/AC - sometimes the latter two (which are much rarer) are also called non-canonical, or semi-canonical. These are motifs for which the splicing mechanism is well established. For all other motifs there is no known biochemical mechanism and most splicing experts do not believe in them. Some of them occur close to canonical junctions and are clearly small spliceosome errors. In the simulations, the false discovery rate for non-canonical junction detection is much higher than for canonical. Filtering non-canonical junctions is safe and commonly accepted - unless you want to study them specifically.

Cheers Alex

alexdobin commented 5 years ago

Hi Leandro I have run the SRR5088841 case with --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --twopassMode Basic

h38, Gencode26

These are shortish reads (~2x50) and have high multimapping rate ~30%, and low non-mapped rate ~7%.

The 1st pass speed (10 threads) was ~630M/hr, and the 2nd ~160M/hr, so about ~4-fold slowdown.

Filtering of junctions: $ awk '$6==0' _STARpass1/SJ.out.tab | wc Number of novel junctions in the 1st pass=10960

awk '$1~/chr[1-2XY]/ && $6==0' _STARpass1/SJ.out.tab Novel junctions for chr1-22,X,Y = 8060

awk '$1~/chr[1-2XY]/ && $6==0 && $5>0 && $7>0' _STARpass1/SJ.out.tab > SJ.filt Novel junctions for chr1-22,X,Y , canonical and supported by a unique mapper = 5041

So this filtering removes ~50% of the novel junctions. Then I run the 2nd pass with --sjdbFileChrStartEnd SJ.filt and the speed got up to ~500M/hr which is decently close to the 1st pass speed of 630M/hr.

You can try to tease out precisely which filtering has the largest effect. My bet is on unique mapper support, as this sample has a high rate of multimappers.

Cheers Alex

hermidalc commented 5 years ago

I have run the SRR5088841 case with --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --twopassMode Basic

h38, Gencode26

These are shortish reads (~2x50) and have high multimapping rate ~30%, and low non-mapped rate ~7%.

The 1st pass speed (10 threads) was ~630M/hr, and the 2nd ~160M/hr, so about ~4-fold slowdown.

Filtering of junctions: $ awk '$6==0' _STARpass1/SJ.out.tab | wc Number of novel junctions in the 1st pass=10960

awk '$1~/chr[1-2XY]/ && $6==0' _STARpass1/SJ.out.tab Novel junctions for chr1-22,X,Y = 8060

awk '$1~/chr[1-2XY]/ && $6==0 && $5>0 && $7>0' _STARpass1/SJ.out.tab > SJ.filt Novel junctions for chr1-22,X,Y , canonical and supported by a unique mapper = 5041

So this filtering removes ~50% of the novel junctions. Then I run the 2nd pass with --sjdbFileChrStartEnd SJ.filt and the speed got up to ~500M/hr which is decently close to the 1st pass speed of 630M/hr.

I sincerely apologize Alex, I gave the wrong run because I'd been looking at a few different ones over the last week that were slow on the 2nd pass and got it mixed up. I believe the correct run I should've referred to was SRR5088824. Either way I think I will try your steps with this run and also runs from study SRP166889.

I would try --outFilterMatchNminOverLread 0.66 (default). It is possible that the 0.33 value allows a number of false junctions which slows down the 2nd pass.

Did the default in a previous version of STAR used to be 0.33? Not sure why the NCI GDC chose to alter from the STAR default to 0.33 for both --outFilterMatchNminOverLread and --outFilterScoreMinOverLread (see on page here click on tab "Dr15plus" <a href=https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#rna-seq-alignment-command-line-parameters" target="_blank>NCI GDC RNA-Seq Alignment Command Line Parameters).

hermidalc commented 5 years ago

I have a separate question regarding the "manual" two-pass with the intermediate SJ.out filtering. My updated STAR workflow uses a generated genome index without GTF file and then I include the GTF file on the fly during the two-pass mapping command so that I can specify the exact --sjdbOverhang for the study or group of runs I'm aligning. I know you said it makes not much difference but did this before.

Anyway, with the manual two-pass do I need to specify the GTF file in both 1st pass and 2nd pass STAR commands?

alexdobin commented 5 years ago

Anyway, with the manual two-pass do I need to specify the GTF file in both 1st pass and 2nd pass STAR commands?

Yes!

Either way I think I will try your steps with this run and also runs from study SRP166889.

Great - let me know how this goes.

Did the default in a previous version of STAR used to be 0.33? Not sure why the NCI GDC chose to alter from the STAR default to 0.33 for both --outFilterMatchNminOverLread and --outFilterScoreMinOverLread

No! The default and recommended values have always been 0.66. Reducing them below 0.5 allows for unpaired single-end alignments and increases the mapping rate - but at the cost of higher false alignments.

hermidalc commented 5 years ago

Regarding your recommended SJ.out filters after the 1st pass. A possible nice feature request would be to have an option where you could specify one or more 1st pass filter types therefore being able to still run automatic two-pass mode. For example:

--twopass1SJfilter Chromosomal Canonical Unique

I know the Chromosomal would be a slight more pain to do since it's more difficult to know what species and what naming convention is used. But there are ways for sure, maybe with another option, to have users specify what they want.

hermidalc commented 5 years ago

Have a question about an important difference I noticed between manual and auto two-pass modes. The 1st pass SJ.out.tab files aren't the same between both modes.

Maybe I'm not running the manual 1st pass with the correct "standard parameters", though I followed your advice of what parameters to exclude. Here's my manual 1st pass:

STAR \
--runThreadN 10 \
--readFilesIn 'SRR5088824/SRR5088824_pass_1.fastq.gz' 'SRR5088824/SRR5088824_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--genomeDir 'star_genome_grch38p2_d1_vd1_gtfv22' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR5088824/_STARpass1/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outSAMtype None \
--readFilesCommand zcat \
--sjdbGTFfile 'gencode.v22.annotation.gtf' \
--sjdbOverhang 50

The resulting SJ.out.tab file has:

$ wc -l SRR5088824/_STARpass1/SJ.out.tab 
210377 SRR5088824/_STARpass1/SJ.out.tab

When I run the equivalent two-pass mode:

STAR \
--runThreadN 10 \
--readFilesIn 'SRR5088824/SRR5088824_pass_1.fastq.gz' 'SRR5088824/SRR5088824_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--chimJunctionOverhangMin 15 \
--chimMainSegmentMultNmax 1 \
--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \
--chimSegmentMin 15 \
--genomeDir 'star_genome_grch38p2_d1_vd1_gtfv22' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR5088824/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outFilterType BySJout \
--outSAMattrRGline "ID:SRR5088824" "SM:SRS1844540" "PL:ILLUMINA" "PM:Illumina Genome Analyzer" \
--outSAMattributes NH HI AS nM NM ch \
--outSAMstrandField intronMotif \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--quantMode GeneCounts \
--readFilesCommand zcat \
--sjdbGTFfile 'gencode.v22.annotation.gtf' \
--sjdbOverhang 50 \
--twopassMode Basic

The resulting 1st pass SJ.out.tab file has:

$ wc -l SRR5088824/_STARpass1/SJ.out.tab 
210433 SRR5088824/_STARpass1/SJ.out.tab

So a difference of 56 junctions in this example.

hermidalc commented 5 years ago

Dear @alexdobin - running the "manual" two-pass with filtering has sped up all of the slow sample runs. Thank you very much. If I may ask you for help on my last question above, I would appreciate it. Basically what parameter settings should be made to generate the same 1st pass SJ.out.tab in manual mode as the one generated by auto two-pass mode.

I just saw now that there is a Google group for general method questions about STAR, so after this I will ask any future questions now related to possible bug etc there.

alexdobin commented 5 years ago

Hi Leandro,

sorry, missed your question... bumping it up is a good idea (every 3 days or so). :) Not sure why this is happening, must be some combination of parameters, need to check it.

Cheers Alex

alexdobin commented 5 years ago

Hi Leandro,

I have run 1- and 2-pass with your parameter on my dataset, and got exactly the same SJ.out.tab... Could you please send me the Log.out files from two runs?

Cheers Alex

hermidalc commented 5 years ago

Manual 1st pass Log.out Auto 1st pass Log.out

I double-checked the 1st pass SJ.out.tabs and same as I wrote above. Remember though I'm using the NCI GDC genome FASTA and GTF files https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files. I built genome index with:

STAR \
--runThreadN 10 \
--runMode genomeGenerate \
--genomeDir star_genome_grch38p2_d1_vd1_gtfv22 \
--genomeFastaFiles GRCh38.d1.vd1.fa
scseekers commented 4 years ago

Hi @hermidalc

I was also planning for running STAR in 2-pass mode and I was trying to follow your steps. However, I had this weird observation When I use --outFilterType BySJout option, I see a higher number of junctions reported rather than when I remove this filter (which is using the default Normal filter). I think BySJout is a more stringent filter and therefore must report lesser junctions. This is from the 1-pass alignment

**After removing --outFilterType BySJout**
(base) [abc@hpc 1st_pass]$ wc -l *SJ.out.tab
  36991 16DAF_MM17_CC.SJ.out.tab
  38672 16DAF_MM18_CC.SJ.out.tab
  55117 4DAF_MM05_CC.SJ.out.tab
  45731 4DAF_MM13_CC.SJ.out.tab
  10860 8DAF_MM21_CC.SJ.out.tab
 187371 total
**Using --outFilterType BySJout**
(base) [abc@hpc 1st_pass]$ wc -l ../sj/*SJ.out.tab
  37286 ../sj/16DAF_MM17_CC.SJ.out.tab
  38964 ../sj/16DAF_MM18_CC.SJ.out.tab
  55333 ../sj/4DAF_MM05_CC.SJ.out.tab
  46116 ../sj/4DAF_MM13_CC.SJ.out.tab
  10919 ../sj/8DAF_MM21_CC.SJ.out.tab
 188618 total

I am using 2.7.6a STAR version and the following options. I need to output BAM to see how much improvement in unique alignment do I see post-second pass

STAR --genomeDir $index --runThreadN 20 \
--readFilesIn $raw_data/${i}_trim.fastq.gz \
--outFileNamePrefix $out/$i. \
--outSAMstrandField intronMotif \
--sjdbGTFfile $gtf \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 20 \
--readFilesCommand zcat \
--quantMode GeneCounts \
--outSAMunmapped Within \
--outSAMattributes NH HI AS NM MD \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outFilterMismatchNmax 999 \
--sjdbScore 1
alexdobin commented 4 years ago

Hi @scseekers

this is actually normal, I believe. The --outFilterType BySJout does not change the filtering of junctions that go into SJ.out.tab. However, it prevent some alignments - those with junctions not passing the filters, and instead of these alignments it may record new ones - those that can contribute new junctions to SJ.out.tab.

Cheers Alex

ArashDepp commented 1 year ago

Hi. I have a related query.

I detected ~ 300,000 new junctions after the suggested filtering when aligning reads for ~ 100 samples (each aligned separately) after the first pass of alignment. Then I generated a new index by using these new junctions in addition to GTF file and perform the second pass alignment afterwards. I also observed a slow down of nearly 4 fold.

Is the slow down a bad sign. Should I worry and try to fix it? The number of uniquely mapped reads are also reduced. Here are the tables showing the proportion of uniquely mapped reads and the corresponding number of samples:

Stats of uniquely mapping reads in first pass:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

% of unique mapping reads | number of samples -- | -- 65 | 1 73 | 1 74 | 1 75 | 2 76 | 3 77 | 8 78 | 6 79 | 11 80 | 8 81 | 14 82 | 11 83 | 9 84 | 8 85 | 4 86 | 1

Stats of uniquely mapping reads after second pass:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

**% of unique mapping reads | number of samples** -- | -- 62 | 1 70 | 2 71 | 1 72 | 3 73 | 4 74 | 9 75 | 11 76 | 9 77 | 7 78 | 11 79 | 12 80 | 3

As you could see, the. number of samples having more than 80% unique mapping are more in the first pass as compared to the second pass. Does that mean I need to fix the second step of alignment in your opinion? Appreciate your response..

alexdobin commented 1 year ago

Hi @ArashDepp

significant reduction of the unique mappers may indicate that too many junctions were detected in the first pass, and you need to filter them down. Did you detect 300,000 junctions per sample, or the total for all samples?

ArashDepp commented 1 year ago

Hi Alex, Thanks for responding. 300K junctions is what I collected after filtering from ~100 samples. My filtering scheme was as follows: cat $firstRunDir/*/*.tab | awk '($5 > 0 && $7 > 4 && $6 == 0)' | cut -f1-6 | sort -k1,1 -k2,2n | uniq > $junctionFile

This was based on suggestions in some of the threads in google groups.

alexdobin commented 1 year ago

Hi @ArashDepp

This is a good filtering scheme, and the collapsed number of junctions from all samples is reasonable. One additional filter I would recommend is to remove all mitochondrial (chrM) junctions, but it may not help. It's possible that both the slowdown and reduction in unique mappers are caused by just a few junctions/loci.

ArashDepp commented 1 year ago

hmm I see. Thanks a lot for discussing this issue. I will try with removing the chrM junctions.

But I have one question though. In the junction file outputted by the first pass alignment, is it possible to convert the number of reads supporting the junction into FPKM/TPM value? That will enable more uniform filtering when processing the datasets of varying library sizes. Might be a good add on.

In most datasets, i retained ~150k novel junctions after filtering, but one datasets had ~300k, possibly due to much larger library size.

alexdobin commented 1 year ago

Calculating FPKM/TPM is not straightforward for splice junctions, since they do not have a length to normalize by. I would simply use CPM, counts per million, i.e. divide the counts by the number of mapped reads in the sample.

ArashDepp commented 1 year ago

Okay. Makes sense. Thank you so much for your help and appreciate your inputs 👍