Open colindaven opened 2 years ago
@irosenboom are there any public PE data which this has been observed on ? Or any data which you can upload after removing the human reads ? Thanks
Hi @colindaven , I uploaded three anonymized PE samples after removal of human reads to my academic cloud. Please feel free to use them for testing: https://sync.academiccloud.de/index.php/s/n6eDexgWjq3M75R
Where exactly did you find information on the clipping penalty (clipping penalty # was 5, try 20)? The only options set for bwa in run_Wochenende.py
are "mem"
(line 716), "-t"
(line 717) and "-R"
(line 719). Is 5 the default clipping penalty?
I can try adding a clipping penalty with "-L 20"
for my data.
Adding a "-L 20"
almost halved the number of species for the samples with abnormally high numbers. However, this means that there are still up to 800 species detected.
Thanks for the data Ilona. Will have a look when I get a chance.
Half is good news - but 800 species is still bad news ! From what I remember, Kraken, metaphlan and centrifuge also found high numbers of alignments with these data - correct ? So it is unlikely we will ever get to perfection.
Because I believe you never saw this with single read data, can you try to just use the R1 reads in single ended alignment mode for your test sample ?
Is was thinking about this, and it cannot be the case that bwa mem just aligns 18bp segments of 100bp single reads, and soft masks the rest, and treats all this as a unique alignment.
It must be that one of the paired reads eg R1 is aligned (with high score, eg low number of mismatches?), and the other eg R2 is then fit with extensive soft clipping.
Yes, I think 5 is the default clipping penalty according to the manual above. You could try "-L 40" or "-L 60" as well to see what happens.
Hi, I have just repeated the analysis with the R1 reads in single ended alignment mode and wow, this also solved the problem!
Nice, still need to solve this though as PE reads are the most common entry point I'd guess.
Can you please repeat your single ended alignments with the R2 - if this works. Maybe rename them to R1 in a separate directory to trick the tool (it always expects R1 reads is SE alignment mode is selected). I wonder if there is a quality problem with the R2 reads, or if it really a problem with the alignment criteria for PE reads only.
It could be a good hit from a poor R2 read, and a very very poor 18bp hit (with the rest soft masked) from a good R1 read.
Thanks!
The files are too big, I can't transfer them reliably. Can you please just supply one sample with R1 and R2 with max size about 50 MB each ?
To do this please gunzip the files, then
head -n 100000 x_R1.fastq > tmp_R1.fastq
head -n 100000 x_R2.fastq > tmp_R2.fastq
Then check file size with
ls -lh *.fastq
Thanks and apologies.
Smaller files are now available in the cloud :)
Wochenende is also running with the R2 files in SE mode, I'll let you know when it's done.
Thanks, I had some weird issues today caused by too many nested conda envs. My bad.
Anyway, thanks for the new data.
These are judged by haybaler output. Keep in mind I am using the user settings in the nextflow.config, not the permissive dev config. eg
// Recommended settings for most users
haybaler_readcount_limit = 10
haybaler_rpmm_limit = 300
Results
output_ilona_SE/haybaler/haybaler_output$ less read_count_haybaler_short.csv
species chr_length gc_ref tmp_sample1_R1 tmp_sample3_R1
Rothia_mucilaginosa 2264603.0 59.62 1560.0 3292.0
Prevotella_melaninogenica 1796408.0 40.91 1106.0 3229.0
Veillonella_atypica 2071952.0 39.11 1120.0 2031.0
Prevotella_jejuni 2204592.0 41.25 2646.0 354.0
Streptococcus_parasanguinis 2153652.0 41.71 527.0 1218.0
output_ilona_PE_mm2/haybaler/haybaler_output$ head -n6 read_count_haybaler_short.csv
species chr_length gc_ref tmp_sample1_R1 tmp_sample2_R1 tmp_sample3_R1
Prevotella_melaninogenica 1796408.0 40.91 1837.0 2433.0 5022.0
Rothia_mucilaginosa 2264603.0 59.62 2661.0 514.0 5635.0
Prevotella_jejuni 2204592.0 41.25 4149.0 866.0 626.0
Veillonella_atypica 2071952.0 39.11 1628.0 425.0 2983.0
Fusobacterium_periodonticum 2222370.0 28.2 644.0 2954.0 475.0
minimap2 params:
aligner = "minimap2short" // Aligner software to use (bwamem, minimap2short, minimap2long, ngmlr)
mismatches = "5" // Exclude reads with x of more mismatches. Suggest 3 for short reads, 10000+ for long reads (integer)
readType = "PE" // SE single ended or PE paired end (SE, PE)
mapping_quality = "mq30" // Mapping quality filter. (mq20, mq30)
cheers
I downloaded alternative PE metagenome data from EBI and can confirm the soft-clipping problem again.
Data : first 200k reads , 800k lines from
SRR13594152_200k_R1.fastq
SRR13594152_200k_R2.fastq
CIGAR string show 126 soft clipped bases, 21 matches and 3 soft clipped bases.
126S21M3S
See https://www.drive5.com/usearch/manual/cigar.html
I'll try to optimize using this dataset since the tiny ones you gave me lead to 0 alignments with bwa-mem. I even get no alignments in the bam - they might have been eliminated by the non-supplementary non-secondary code I just recently introduced. I'll try to check this.
I set the bwa mem -L parameter to -L 60
. Few changes in soft clipping observed.
I then set the -L parameter to -L 1000,1000
. Still soft clipped reads and poor alignments are found.
eg this read pair - alignment built on 41 matches and 19 matches. This has a MQ of 40 so will likely be output ...
SRR13594152.65811 99 1_AP012320_1_Rubrivivax_gelatinosus_IL144_DNA__complete_genome_BAC 2545085 40 43S41M66S = 2545404 338 CTCGGCCTCGACGGCGCA
SRR13594152.65811 147 1_AP012320_1_Rubrivivax_gelatinosus_IL144_DNA__complete_genome_BAC 2545404 40 124S19M7S = 2545085 -338 CGCGC
Potentially we need to set the scoring differently for PE reads -T 60
?, which is just weird. I thought bwa was a) the most tested aligner in the world and b) set up for PE reads. I guess it's mainly appropriate for genomes, not metagenomes, and tuned accordingly.
-T INT minimum score to output [30]
@irosenboom please avoid using the fastp
read trimmer with PE
data at present, it is not feeding the data through to the aligner properly. Which explains why using minimap2short with PE and SE data led to identical results for me.
After not seeing much improvement despite setting -T 60 (default is 30) for PE reads, I would like to disallow bwa mem for PE reads due to the strong variation in results. Is this ok with you Ilona ?
Details here:
https://github.com/MHH-RCUG/nf_wochenende/wiki/Troubleshooting#bwa-mem-with-paired-end-pe-reads
Also - this is a major issue - minimap2short attributes reads to common airway species from the datasets you gave me in PE mode, but bwa mem does not.
here mm2
species SRR11207337_mock_sm_R1 chr_length gc_ref tmp_sample1_R1
1_AE004091_2_Pseudomonas_aeruginosa_PAO1__complete_genome_BAC 86206.0 6264404.0 66.57 0.0
1_AE006468_2_Salmonella_enterica_subsp__enterica_serovar_Typhimurium_str__LT2__complete_genome_BAC 52056.0 4857450.0 52.21 0.0
1_U00096_3_Escherichia_coli_str__K_12_substr__MG1655__complete_genome_BAC 50895.0 4641652.0 50.77 0.0
1_CP011051_1_Bacillus_intestinalis_strain_T30__complete_genome_BAC 48550.0 4031727.0 43.89 0.0
1_AE017262_2_Listeria_monocytogenes_str__4b_F2365__complete_genome_BAC 38025.0 2905187.0 38.04 0.0
1_AJ938182_1_Staphylococcus_aureus_RF122_complete_genome_BAC 34099.0 2742531.0 32.77 0.0
1_AE016830_1_Enterococcus_faecalis_V583_chromosome__complete_genome_BAC 31505.0 3218031.0 37.52 0.0
1_AP008937_1_Lactobacillus_fermentum_IFO_3956_DNA__complete_genome_BAC 17107.0 2098685.0 51.46 0.0
1_CP023863_1_Prevotella_jejuni_strain_CD3_33_chromosome_I__complete_sequence_BAC 0.0 2204592.0 41.25 4149.0
1_AP011540_1_Rothia_mucilaginosa_DY_18_DNA__complete_genome_BAC 0.0 2264603.0 59.62 2661.0
1_CP014787_1_Lactobacillus_oris_strain_J_1_chromosome__complete_genome_BAC 2164.0 3171194.0 51.36 0.0
1_CP002122_1_Prevotella_melaninogenica_ATCC_25845_chromosome_I__complete_sequence_BAC 0.0 1796408.0 40.91 1837.0
1_CP020566_1_Veillonella_atypica_strain_OK5_chromosome__complete_genome_BAC 0.0 2071952.0 39.11 1628.0
here bwa mem - different datasets, no mock comm, but illustrates my point - tmp_sample should have some reads attributed.
species SRR13594152_200k_R1 chr_length gc_ref tmp_sample1_R1 tmp_sample2_R1 tmp_sample3_R1
Alicycliphilus_denitrificans 2207.0 4637013.0 68.3 0.0 0.0 0.0
Ignavibacterium_album 1302.0 3658997.0 33.93 0.0 0.0 0.0
Methylomonas_koyamae 397.0 4894002.0 56.43 0.0 0.0 0.0
Rubrivivax_gelatinosus 356.0 5043253.0 71.27 0.0 0.0 0.0
Methylosinus_trichosporium 332.0 4508832.0 66.05 0.0 0.0 0.0
Dechlorosoma_suillum 330.0 3806980.0 65.38 0.0 0.0 0.0
Sulfuritalea_hydrogenivorans 284.0 3802648.0 63.28 0.0 0.0 0.0
Mesorhizobium_oceanicum 283.0 5158483.0 65.21 0.0 0.0 0.0
Bug report by @irosenboom
It was reported for 2x150bp Novaseq reads that bwa mem performed alignments with extensive softclipping.
For example, some alignments were of 18bp with 82 bp simply softclipped off.
This led to serious problems with the species detected. We never noticed this before with single ended 75bp reads (confirmed here by test with the same datasets).
For these datasets, minimap2short led to much better alignments (provided the supplementary and secondary alignments were removed).
Solutions
bwa mem
@irosenboom can you please try this and see if your results improve ? I don't have comparable datasetsBWA mem options to test:
-M
-P
-L
clipping penalty # was 5, try 20 ? This is the most likely option to solve the issueFrom https://www.mankier.com/1/bwa