Closed hermidalc closed 2 years ago
Hi Leandro,
how many junctions were detected in the 1st pass - you can check it in the Log.out file or by counting lines of _STARpass1/SJ.out.tab ? Large number of junctions (>500k) will slow down the 2nd pass mapping.
Another possibility that there is a highly expressed locus with many novel junctions, typically false positive junctions - e.g. in mitochondrion genome, or in rRNA loci.
In any case, you would need to filter the SJ.out.tab file before the 2nd pass mapping, so you would need to do "manual" 2-pass mapping (without --twopassMode option):
Cheers Alex
Dear @alexdobin - for this study the read length is 2x150bp, does it matter significantly that the genome index was created with default --sjdbOverhang 100
? Also, I noticed in the mapping of runs from this study that compared to other studies where unique mapping is > 90% this one gets ~70% with ~30% multi-mapped. Is there something I'm not aware of that is happening or is this normal for human data?
@alexdobin
how many junctions were detected in the 1st pass - you can check it in the Log.out file or by counting lines of _STARpass1/SJ.out.tab ? Large number of junctions (>500k) will slow down the 2nd pass mapping.
Sorry I'm confused about something here. The number of lines in the SJ.out.tab file:
$ wc -l SRR8112647/_STARpass1/SJ.out.tab
365671 SRR8112647/_STARpass1/SJ.out.tab
is much less than the number of junctions reported in Log.out, which is either 713112, or 146928 + 347441 = 494369?
Sep 11 18:00:03 Loaded database junctions from the generated genome star_grch38p2_d1_vd1_gtfv22/sjdbList.out.tab: 347441 total junctions
Sep 11 18:00:03 Loaded database junctions from the 1st pass file: SRR8112647/_STARpass1//SJ.out.tab: 713112 total junctions
WARNING: long repeat for junction # 250346 : chr15 88855522 88855578; left shift = 14; right shift = 255
WARNING: long repeat for junction # 292573 : chr18 46969610 47029116; left shift = 183; right shift = 255
WARNING: long repeat for junction # 592560 : chr15 88855522 88855578; left shift = 14; right shift = 255
Sep 11 18:00:03 Finished preparing junctions
Sep 11 18:00:03 ..... inserting junctions into the genome indices
Sep 11 18:00:37 Finished SA search: number of new junctions=146928, old junctions=347441
Sep 11 18:00:51 Finished sorting SA indicesL nInd=58762556
Genome size with junctions=3915398377 3816030208 99368169
GstrandBit1=32 GstrandBit=32
Sep 11 18:01:49 Finished inserting junction indices
Sep 11 18:01:59 Finished SAi
Sep 11 18:02:00 ..... finished inserting junctions into genome
Dear @alexdobin -
In any case, you would need to filter the SJ.out.tab file before the 2nd pass mapping, so you would need to do "manual" 2-pass mapping (without --twopassMode option):
I've done the manual 2-pass with SJ filtering as you suggested, though it hasn't helped really with the 2nd pass mapping speed, it's still ~30-40 M/hr compared to ~250 M/hr in the 1st pass.
Here are the steps I did following your guidance, please tell me if any options are missing or shouldn't be there:
- Map 1st pass with standard parameters, without BAM/SAM output (--outSAMtype None), without --chim* options and without --outFilterType BySJout.
STAR \
--runThreadN 12 \
--readFilesIn 'SRR8112647/SRR8112647_pass_1.fastq.gz' 'SRR8112647/SRR8112647_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--genomeDir 'star_grch38p2_d1_vd1_gtfv22_rl150' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR8112647/_STARpass1/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outSAMtype None \
--readFilesCommand zcat
- Filter the resulting SJ.out.tab file. I would start with removing all junctions mapping to chrM and non-chromosomal contigs.
awk '{ if ($1 ~ /^chr([1-9][0-9]?|X|Y)$/) { print } }' 'SRR8112647/_STARpass1/SJ.out.tab' > 'SRR8112647/_STARpass1/SJ.filtered.out.tab'
Only a few thousand junctions were removed:
$ wc -l SRR8112647/_STARpass1/SJ.out.tab
365607 SRR8112647/_STARpass1/SJ.out.tab
$ wc -l SRR8112647/_STARpass1/SJ.filtered.out.tab
357141 SRR8112647/_STARpass1/SJ.filtered.out.tab
- Run second pass (in a separate directory) with all the output options and --sjdbFileChrStartEnd /path/to/SJ.out.tab.filtered.
STAR \
--runThreadN 12 \
--readFilesIn 'SRR8112647/SRR8112647_pass_1.fastq.gz' 'SRR8112647/SRR8112647_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--chimJunctionOverhangMin 15 \
--chimMainSegmentMultNmax 1 \
--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \
--chimSegmentMin 15 \
--genomeDir 'star_grch38p2_d1_vd1_gtfv22_rl150' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR8112647/_STARpass2/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outFilterType BySJout \
--outSAMattrRGline "ID:SRR8112647" "SM:SRS3982919" "PL:ILLUMINA" "PM:Illumina HiSeq 3000" \
--outSAMattributes NH HI AS nM NM ch \
--outSAMstrandField intronMotif \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--quantMode GeneCounts \
--readFilesCommand zcat \
--sjdbFileChrStartEnd 'SRR8112647/_STARpass1/SJ.filtered.out.tab'
Again in the Log.out file it's confusing what is the total number of junctions after inserting:
Sep 11 20:42:47 Loaded database junctions from the generated genome star_grch38p2_d1_vd1_gtfv22_rl150/sjdbList.out.tab: 347441 total junctions
Loaded database junctions from file: SRR8112647/_STARpass1/SJ.filtered.out.tab, total number of junctions: 704582 junctions
Sep 11 20:42:47 Loaded database junctions from the pGe.sjdbFileChrStartEnd file(s), 704582 total junctions
WARNING: long repeat for junction # 250346 : chr15 88855522 88855578; left shift = 14; right shift = 255
WARNING: long repeat for junction # 292573 : chr18 46969610 47029116; left shift = 183; right shift = 255
WARNING: long repeat for junction # 593204 : chr15 88855522 88855578; left shift = 14; right shift = 255
Sep 11 20:42:48 Finished preparing junctions
Sep 11 20:42:48 ..... inserting junctions into the genome indices
Sep 11 20:43:39 Finished SA search: number of new junctions=138464, old junctions=347441
Sep 11 20:43:58 Finished sorting SA indicesL nInd=82521930
Genome size with junctions=3961315803 3816030208 145285595
GstrandBit1=32 GstrandBit=32
Sep 11 20:44:59 Finished inserting junction indices
Sep 11 20:45:11 Finished SAi
Sep 11 20:45:12 ..... finished inserting junctions into genome
I've looked at whether runs which have slow 2nd pass mapping rate have any rRNA contamination with bbduk
but they do not. I've also found two more example runs from another study SRP070710 that have slow 2nd pass mapping even though they do not have a large number of new junctions SRR3184294 and SRR3184295. Worth looking at if you ever have time to see what STAR is doing that causes a huge slow down in mapping rate between the 1st and 2nd pass.
Hi Leandro,
using --sjdbOverhang 100 for 2x150 reads is fine, it changes only a very small portion of alignments.
30% multimapping reads is on the high side for normal human data. You need to compare other mapping statistics in the Log.final.out file to see if anything else is different between the samples. What is the mapping rate for the 1st pass?
365,607 is the total number of detected junctions (novel and annotated). In the 2nd pass, Log.out reports number of new junctions=138464, old junctions=347441 The old junctions are the annotated (from GTF) and new 138464 junctions are (only) novel junctions from the SJ.out.tab.filtered. Note that you can filter out all the annotated junctions from the SJ.out.tab (i.e. keep only col6==0) since all annotated junctions are taken from the GTF file.
There are further filtering steps to increase the mapping speed of the 2nd pass, if you are willing to try them. I am not sure if it's worse the time though. :) First, I would try --outFilterMatchNminOverLread 0.66 (default). It is possible that the 0.33 value allows a number of false junctions which slows down the 2nd pass. The next one is to take out all non-canonical junctions (i.e. keep col5>0 in SJ.out.tab) and junction supported by only multimappers (i.e. keep col7>0). If this does not help, you can try to filter out junctions supported by few reads, i.e. do a series of runs keeping col7>=2,3,4...
If none of the above helps, I will try to look into it when I have time, since this is public data. I have encountered a few cases where the slowdown was caused by an abnormally highly expressed locus in the genome with a few splice junctions.
Cheers Alex
@alexdobin thanks for the advice.
30% multimapping reads is on the high side for normal human data. You need to compare other mapping statistics in the Log.final.out file to see if anything else is different between the samples. What is the mapping rate for the 1st pass?
All the runs from the public study SRP166889 have a 1st pass mapping rate is ~250 M/hr, 2nd pass ~30 M/hr, and a multi-mapping of ~30%. I found no significant rRNA contamination so not sure why.
For your own personal interest a single good example is from another study SRP094781. I would look at is SRR5088841. It has >250 M/hr 1st pass but then on the 2nd pass it had the slowest speed I've seen 15-20 M/hr. It has a good unique mapping rate ~90% and the 1st pass didn't add many junctions. So very interesting to know what in this run is causing STAR to go much slower.
Sorry edited above the good example run should be SRR5088841
365,607 is the total number of detected junctions (novel and annotated). In the 2nd pass, Log.out reports number of new junctions=138464, old junctions=347441 The old junctions are the annotated (from GTF) and new 138464 junctions are (only) novel junctions from the SJ.out.tab.filtered.
Need to ask you here, could you explain how these numbers are related? Since you said 365,607 = novel + annotated, and annotated = 347441 and novel = 138464 then the numbers don't add up.
The next one is to take out all non-canonical junctions (i.e. keep col5>0 in SJ.out.tab)
I'm not familiar with what the difference is between canonical and non-canonical junctions, but from a quick background read it seems that non-canonical junctions are important? Sorry I'm just trying to make sure that whatever I filter out isn't changing the end results (which currently in my case are the gene read count quants)
Hi Leandro,
old junctions=347441 is the total number of annotated junctions from GTF, not all of them are detected. new junctions=138464 are novel detected junctions 365607 = wc -l SJ.out.tab = total number of detected junctions = new junctions + detected annotated junctions detected annotated junctions= 365607-138464 = subset of 347441 all annotated junctions.
What I call non-canonical junctions are non GT/AG, GC/AG, AT/AC - sometimes the latter two (which are much rarer) are also called non-canonical, or semi-canonical. These are motifs for which the splicing mechanism is well established. For all other motifs there is no known biochemical mechanism and most splicing experts do not believe in them. Some of them occur close to canonical junctions and are clearly small spliceosome errors. In the simulations, the false discovery rate for non-canonical junction detection is much higher than for canonical. Filtering non-canonical junctions is safe and commonly accepted - unless you want to study them specifically.
Cheers Alex
Hi Leandro I have run the SRR5088841 case with --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --twopassMode Basic
h38, Gencode26
These are shortish reads (~2x50) and have high multimapping rate ~30%, and low non-mapped rate ~7%.
The 1st pass speed (10 threads) was ~630M/hr, and the 2nd ~160M/hr, so about ~4-fold slowdown.
Filtering of junctions: $ awk '$6==0' _STARpass1/SJ.out.tab | wc Number of novel junctions in the 1st pass=10960
awk '$1~/chr[1-2XY]/ && $6==0' _STARpass1/SJ.out.tab Novel junctions for chr1-22,X,Y = 8060
awk '$1~/chr[1-2XY]/ && $6==0 && $5>0 && $7>0' _STARpass1/SJ.out.tab > SJ.filt Novel junctions for chr1-22,X,Y , canonical and supported by a unique mapper = 5041
So this filtering removes ~50% of the novel junctions. Then I run the 2nd pass with --sjdbFileChrStartEnd SJ.filt and the speed got up to ~500M/hr which is decently close to the 1st pass speed of 630M/hr.
You can try to tease out precisely which filtering has the largest effect. My bet is on unique mapper support, as this sample has a high rate of multimappers.
Cheers Alex
I have run the SRR5088841 case with --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --twopassMode Basic
h38, Gencode26
These are shortish reads (~2x50) and have high multimapping rate ~30%, and low non-mapped rate ~7%.
The 1st pass speed (10 threads) was ~630M/hr, and the 2nd ~160M/hr, so about ~4-fold slowdown.
Filtering of junctions: $ awk '$6==0' _STARpass1/SJ.out.tab | wc Number of novel junctions in the 1st pass=10960
awk '$1~/chr[1-2XY]/ && $6==0' _STARpass1/SJ.out.tab Novel junctions for chr1-22,X,Y = 8060
awk '$1~/chr[1-2XY]/ && $6==0 && $5>0 && $7>0' _STARpass1/SJ.out.tab > SJ.filt Novel junctions for chr1-22,X,Y , canonical and supported by a unique mapper = 5041
So this filtering removes ~50% of the novel junctions. Then I run the 2nd pass with --sjdbFileChrStartEnd SJ.filt and the speed got up to ~500M/hr which is decently close to the 1st pass speed of 630M/hr.
I sincerely apologize Alex, I gave the wrong run because I'd been looking at a few different ones over the last week that were slow on the 2nd pass and got it mixed up. I believe the correct run I should've referred to was SRR5088824. Either way I think I will try your steps with this run and also runs from study SRP166889.
I would try --outFilterMatchNminOverLread 0.66 (default). It is possible that the 0.33 value allows a number of false junctions which slows down the 2nd pass.
Did the default in a previous version of STAR used to be 0.33? Not sure why the NCI GDC chose to alter from the STAR default to 0.33 for both --outFilterMatchNminOverLread
and --outFilterScoreMinOverLread
(see on page here click on tab "Dr15plus" <a href=https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#rna-seq-alignment-command-line-parameters" target="_blank>NCI GDC RNA-Seq Alignment Command Line Parameters).
I have a separate question regarding the "manual" two-pass with the intermediate SJ.out filtering. My updated STAR workflow uses a generated genome index without GTF file and then I include the GTF file on the fly during the two-pass mapping command so that I can specify the exact --sjdbOverhang
for the study or group of runs I'm aligning. I know you said it makes not much difference but did this before.
Anyway, with the manual two-pass do I need to specify the GTF file in both 1st pass and 2nd pass STAR commands?
Anyway, with the manual two-pass do I need to specify the GTF file in both 1st pass and 2nd pass STAR commands?
Yes!
Either way I think I will try your steps with this run and also runs from study SRP166889.
Great - let me know how this goes.
Did the default in a previous version of STAR used to be 0.33? Not sure why the NCI GDC chose to alter from the STAR default to 0.33 for both --outFilterMatchNminOverLread and --outFilterScoreMinOverLread
No! The default and recommended values have always been 0.66. Reducing them below 0.5 allows for unpaired single-end alignments and increases the mapping rate - but at the cost of higher false alignments.
Regarding your recommended SJ.out filters after the 1st pass. A possible nice feature request would be to have an option where you could specify one or more 1st pass filter types therefore being able to still run automatic two-pass mode. For example:
--twopass1SJfilter Chromosomal Canonical Unique
I know the Chromosomal
would be a slight more pain to do since it's more difficult to know what species and what naming convention is used. But there are ways for sure, maybe with another option, to have users specify what they want.
Have a question about an important difference I noticed between manual and auto two-pass modes. The 1st pass SJ.out.tab files aren't the same between both modes.
Maybe I'm not running the manual 1st pass with the correct "standard parameters", though I followed your advice of what parameters to exclude. Here's my manual 1st pass:
STAR \
--runThreadN 10 \
--readFilesIn 'SRR5088824/SRR5088824_pass_1.fastq.gz' 'SRR5088824/SRR5088824_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--genomeDir 'star_genome_grch38p2_d1_vd1_gtfv22' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR5088824/_STARpass1/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outSAMtype None \
--readFilesCommand zcat \
--sjdbGTFfile 'gencode.v22.annotation.gtf' \
--sjdbOverhang 50
The resulting SJ.out.tab file has:
$ wc -l SRR5088824/_STARpass1/SJ.out.tab
210377 SRR5088824/_STARpass1/SJ.out.tab
When I run the equivalent two-pass mode:
STAR \
--runThreadN 10 \
--readFilesIn 'SRR5088824/SRR5088824_pass_1.fastq.gz' 'SRR5088824/SRR5088824_pass_2.fastq.gz' \
--alignIntronMax 1000000 \
--alignIntronMin 20 \
--alignMatesGapMax 1000000 \
--alignSJDBoverhangMin 1 \
--alignSJoverhangMin 8 \
--alignSoftClipAtReferenceEnds Yes \
--chimJunctionOverhangMin 15 \
--chimMainSegmentMultNmax 1 \
--chimOutType Junctions SeparateSAMold WithinBAM SoftClip \
--chimSegmentMin 15 \
--genomeDir 'star_genome_grch38p2_d1_vd1_gtfv22' \
--genomeLoad NoSharedMemory \
--limitSjdbInsertNsj 1200000 \
--outFileNamePrefix 'SRR5088824/' \
--outFilterIntronMotifs None \
--outFilterMatchNminOverLread 0.33 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.1 \
--outFilterMultimapNmax 20 \
--outFilterScoreMinOverLread 0.33 \
--outFilterType BySJout \
--outSAMattrRGline "ID:SRR5088824" "SM:SRS1844540" "PL:ILLUMINA" "PM:Illumina Genome Analyzer" \
--outSAMattributes NH HI AS nM NM ch \
--outSAMstrandField intronMotif \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--quantMode GeneCounts \
--readFilesCommand zcat \
--sjdbGTFfile 'gencode.v22.annotation.gtf' \
--sjdbOverhang 50 \
--twopassMode Basic
The resulting 1st pass SJ.out.tab file has:
$ wc -l SRR5088824/_STARpass1/SJ.out.tab
210433 SRR5088824/_STARpass1/SJ.out.tab
So a difference of 56 junctions in this example.
Dear @alexdobin - running the "manual" two-pass with filtering has sped up all of the slow sample runs. Thank you very much. If I may ask you for help on my last question above, I would appreciate it. Basically what parameter settings should be made to generate the same 1st pass SJ.out.tab in manual mode as the one generated by auto two-pass mode.
I just saw now that there is a Google group for general method questions about STAR, so after this I will ask any future questions now related to possible bug etc there.
Hi Leandro,
sorry, missed your question... bumping it up is a good idea (every 3 days or so). :) Not sure why this is happening, must be some combination of parameters, need to check it.
Cheers Alex
Hi Leandro,
I have run 1- and 2-pass with your parameter on my dataset, and got exactly the same SJ.out.tab... Could you please send me the Log.out files from two runs?
Cheers Alex
Manual 1st pass Log.out Auto 1st pass Log.out
I double-checked the 1st pass SJ.out.tabs and same as I wrote above. Remember though I'm using the NCI GDC genome FASTA and GTF files https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files. I built genome index with:
STAR \
--runThreadN 10 \
--runMode genomeGenerate \
--genomeDir star_genome_grch38p2_d1_vd1_gtfv22 \
--genomeFastaFiles GRCh38.d1.vd1.fa
Hi @hermidalc
I was also planning for running STAR in 2-pass mode and I was trying to follow your steps. However, I had this weird observation
When I use --outFilterType BySJout
option, I see a higher number of junctions reported rather than when I remove this filter (which is using the default Normal filter). I think BySJout is a more stringent filter and therefore must report lesser junctions. This is from the 1-pass alignment
**After removing --outFilterType BySJout**
(base) [abc@hpc 1st_pass]$ wc -l *SJ.out.tab
36991 16DAF_MM17_CC.SJ.out.tab
38672 16DAF_MM18_CC.SJ.out.tab
55117 4DAF_MM05_CC.SJ.out.tab
45731 4DAF_MM13_CC.SJ.out.tab
10860 8DAF_MM21_CC.SJ.out.tab
187371 total
**Using --outFilterType BySJout**
(base) [abc@hpc 1st_pass]$ wc -l ../sj/*SJ.out.tab
37286 ../sj/16DAF_MM17_CC.SJ.out.tab
38964 ../sj/16DAF_MM18_CC.SJ.out.tab
55333 ../sj/4DAF_MM05_CC.SJ.out.tab
46116 ../sj/4DAF_MM13_CC.SJ.out.tab
10919 ../sj/8DAF_MM21_CC.SJ.out.tab
188618 total
I am using 2.7.6a
STAR version and the following options. I need to output BAM to see how much improvement in unique alignment do I see post-second pass
STAR --genomeDir $index --runThreadN 20 \
--readFilesIn $raw_data/${i}_trim.fastq.gz \
--outFileNamePrefix $out/$i. \
--outSAMstrandField intronMotif \
--sjdbGTFfile $gtf \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 20 \
--readFilesCommand zcat \
--quantMode GeneCounts \
--outSAMunmapped Within \
--outSAMattributes NH HI AS NM MD \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outFilterMismatchNmax 999 \
--sjdbScore 1
Hi @scseekers
this is actually normal, I believe. The --outFilterType BySJout does not change the filtering of junctions that go into SJ.out.tab. However, it prevent some alignments - those with junctions not passing the filters, and instead of these alignments it may record new ones - those that can contribute new junctions to SJ.out.tab.
Cheers Alex
Hi. I have a related query.
I detected ~ 300,000 new junctions after the suggested filtering when aligning reads for ~ 100 samples (each aligned separately) after the first pass of alignment. Then I generated a new index by using these new junctions in addition to GTF file and perform the second pass alignment afterwards. I also observed a slow down of nearly 4 fold.
Is the slow down a bad sign. Should I worry and try to fix it? The number of uniquely mapped reads are also reduced. Here are the tables showing the proportion of uniquely mapped reads and the corresponding number of samples:
Stats of uniquely mapping reads in first pass:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
% of unique mapping reads | number of samples -- | -- 65 | 1 73 | 1 74 | 1 75 | 2 76 | 3 77 | 8 78 | 6 79 | 11 80 | 8 81 | 14 82 | 11 83 | 9 84 | 8 85 | 4 86 | 1
I'm using STAR 2.7.2b and have been aligning SRA study SRP166889. Unlike others I've done it seems like the reads from this study cause STAR to have a very slow 2nd pass. Take for example from the study SRR8112647:
When looking at
Log.progress.out
2nd pass M/hr is ~30 M/hr, almost an order of magnitude slower than 1st pass ~250 M/hr. What could be the cause of this?GRCh38 genome index created using NCI GDC source files and standard params:
SRA paired-end FASTQs created using recommended standard params: