MHH-RCUG / nf_wochenende

A nextflow version of the Wochenende reference metagenome binning and visualization pipeline
MIT License
13 stars 2 forks source link

PE reads - bwa mem alignment and softclipping issues #88

Open colindaven opened 2 years ago

colindaven commented 2 years ago

Bug report by @irosenboom

It was reported for 2x150bp Novaseq reads that bwa mem performed alignments with extensive softclipping.

For example, some alignments were of 18bp with 82 bp simply softclipped off.

This led to serious problems with the species detected. We never noticed this before with single ended 75bp reads (confirmed here by test with the same datasets).

For these datasets, minimap2short led to much better alignments (provided the supplementary and secondary alignments were removed).

Solutions

BWA mem options to test:

From https://www.mankier.com/1/bwa


ALGORITHM OPTIONS:
-t INT

    Number of threads [1]
-k INT

    Minimum seed length. Matches shorter than INT will be missed. The alignment speed is usually insensitive to this value unless it significantly deviates from 20. [19]
-w INT

    Band width. Essentially, gaps longer than INT will not be found. Note that the maximum gap length is also affected by the scoring matrix and the hit length, not solely determined by this option. [100]
-d INT

    Off-diagonal X-dropoff (Z-dropoff). Stop extension when the difference between the best and the current extension score is above |i-j|*A+INT, where i and j are the current positions of the query and reference, respectively, and A is the matching score. Z-dropoff is similar to BLAST's X-dropoff except that it doesn't penalize gaps in one of the sequences in the alignment. Z-dropoff not only avoids unnecessary extension, but also reduces poor alignments inside a long good alignment. [100]
-r FLOAT

    Trigger re-seeding for a MEM longer than minSeedLen*FLOAT. This is a key heuristic parameter for tuning the performance. Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy. [1.5]
-c INT

    Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [500]
-D FLOAT

    Drop chains shorter than FLOAT fraction of the longest overlapping chain [0.5]
-m INT

    Perform at most INT rounds of mate-SW [50]
-W INT

    Drop a chain if the number of bases in seeds is smaller than INT. This option is primarily used for longer contigs/reads. When positive, it also affects seed filtering. [0]
-P

    In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.
SCORING OPTIONS:
-A INT

    Matching score. [1]
-B INT

    Mismatch penalty. The sequence error rate is approximately: {.75 * exp[-log(4) * B/A]}. [4]
-O INT[,INT]

    Gap open penalty. If two numbers are specified, the first is the penalty of openning a deletion and the second for openning an insertion. [6]
-E INT[,INT]

    Gap extension penalty. If two numbers are specified, the first is the penalty of extending a deletion and second for extending an insertion. A gap of length k costs O + k*E (i.e. [-O](https://www.mankier.com/1/bwa#-O) is for opening a zero-length gap). [1]
-L INT[,INT]

    Clipping penalty. When performing SW extension, BWA-MEM keeps track of the best score reaching the end of query. If this score is larger than the best SW score minus the clipping penalty, clipping will not be applied. Note that in this case, the SAM AS tag reports the best SW score; clipping penalty is not deduced. If two numbers are provided, the first is for 5'-end clipping and second for 3'-end clipping. [5]
-U INT

    Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as scoreRead1+scoreRead2-INT and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these two scores to determine whether we should force pairing. A larger value leads to more aggressive read pair. [17]
-x STR

    Read type. Changes multiple parameters unless overriden [null]

    pacbio:

        -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)
    ont2d:

        -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)
    intractg:

        -B9 -O16 -L5 (intra-species contigs to ref)

INPUT/OUTPUT OPTIONS:
-p

    Smart pairing. If two adjacent reads have the same name, they are considered to form a read pair. This way, paired-end and single-end reads can be mixed in a single FASTA/Q stream.
-R STR

    Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null]
-H ARG

    If ARG starts with @, it is interpreted as a string and gets inserted into the output SAM header; otherwise, ARG is interpreted as a file with all lines starting with @ in the file inserted into the SAM header. [null]
-o FILE

    Write the output SAM file to FILE. For compatibility with other BWA commands, this option may also be given as -f FILE. [standard ouptut]
-q

    Don't reduce the mapping quality of split alignment of lower alignment score.
-5

    For split alignment, mark the segment with the smallest coordinate as the primary. It automatically applies option [-q](https://www.mankier.com/1/bwa#-q) as well. This option may help some Hi-C pipelines. By default, BWA-MEM marks highest scoring segment as primary.
-K  INT

    Process INT input bases in each batch regardless of the number of threads in use [10000000*nThreads]. By default, the batch size is proportional to the number of threads in use. Because the inferred insert size distribution slightly depends on the batch size, using different number of threads may produce different output. Specifying this option helps reproducibility.
-T INT

    Don't output alignment with score lower than INT. This option affects output and occasionally SAM flag 2. [30]
-j

    Treat ALT contigs as part of the primary assembly (i.e. ignore the db.prefix.alt file).
-h INT[,INT2]

    If a query has not more than INT hits with score higher than 80% of the best hit, output them all in the XA tag. If INT2 is specified, BWA-MEM outputs up to INT2 hits if the list contains a hit to an ALT contig. [5,200]
-a

    Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments.
-C

    Append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.
-Y

    Use soft clipping CIGAR operation for supplementary alignments. By default, BWA-MEM uses soft clipping for the primary alignment and hard clipping for supplementary alignments.
-M

    Mark shorter split hits as secondary (for Picard compatibility).
colindaven commented 2 years ago

@irosenboom are there any public PE data which this has been observed on ? Or any data which you can upload after removing the human reads ? Thanks

irosenboom commented 2 years ago

Hi @colindaven , I uploaded three anonymized PE samples after removal of human reads to my academic cloud. Please feel free to use them for testing: https://sync.academiccloud.de/index.php/s/n6eDexgWjq3M75R

irosenboom commented 2 years ago

Where exactly did you find information on the clipping penalty (clipping penalty # was 5, try 20)? The only options set for bwa in run_Wochenende.py are "mem" (line 716), "-t" (line 717) and "-R" (line 719). Is 5 the default clipping penalty? I can try adding a clipping penalty with "-L 20" for my data.

irosenboom commented 2 years ago

Adding a "-L 20" almost halved the number of species for the samples with abnormally high numbers. However, this means that there are still up to 800 species detected.

colindaven commented 2 years ago

Thanks for the data Ilona. Will have a look when I get a chance.

Half is good news - but 800 species is still bad news ! From what I remember, Kraken, metaphlan and centrifuge also found high numbers of alignments with these data - correct ? So it is unlikely we will ever get to perfection.

Because I believe you never saw this with single read data, can you try to just use the R1 reads in single ended alignment mode for your test sample ?

Is was thinking about this, and it cannot be the case that bwa mem just aligns 18bp segments of 100bp single reads, and soft masks the rest, and treats all this as a unique alignment.

It must be that one of the paired reads eg R1 is aligned (with high score, eg low number of mismatches?), and the other eg R2 is then fit with extensive soft clipping.

Yes, I think 5 is the default clipping penalty according to the manual above. You could try "-L 40" or "-L 60" as well to see what happens.

irosenboom commented 2 years ago

Hi, I have just repeated the analysis with the R1 reads in single ended alignment mode and wow, this also solved the problem!

colindaven commented 2 years ago

Nice, still need to solve this though as PE reads are the most common entry point I'd guess.

Can you please repeat your single ended alignments with the R2 - if this works. Maybe rename them to R1 in a separate directory to trick the tool (it always expects R1 reads is SE alignment mode is selected). I wonder if there is a quality problem with the R2 reads, or if it really a problem with the alignment criteria for PE reads only.

It could be a good hit from a poor R2 read, and a very very poor 18bp hit (with the rest soft masked) from a good R1 read.

Thanks!

colindaven commented 2 years ago

The files are too big, I can't transfer them reliably. Can you please just supply one sample with R1 and R2 with max size about 50 MB each ?

To do this please gunzip the files, then


head -n 100000 x_R1.fastq > tmp_R1.fastq
head -n 100000 x_R2.fastq > tmp_R2.fastq

Then check file size with ls -lh *.fastq

Thanks and apologies.

irosenboom commented 2 years ago

Smaller files are now available in the cloud :)

Wochenende is also running with the R2 files in SE mode, I'll let you know when it's done.

colindaven commented 2 years ago

Thanks, I had some weird issues today caused by too many nested conda envs. My bad.

Anyway, thanks for the new data.

These are judged by haybaler output. Keep in mind I am using the user settings in the nextflow.config, not the permissive dev config. eg

  // Recommended settings for most users
  haybaler_readcount_limit = 10
  haybaler_rpmm_limit = 300

Results

output_ilona_SE/haybaler/haybaler_output$ less read_count_haybaler_short.csv
species chr_length      gc_ref  tmp_sample1_R1  tmp_sample3_R1
Rothia_mucilaginosa     2264603.0       59.62   1560.0  3292.0
Prevotella_melaninogenica       1796408.0       40.91   1106.0  3229.0
Veillonella_atypica     2071952.0       39.11   1120.0  2031.0
Prevotella_jejuni       2204592.0       41.25   2646.0  354.0
Streptococcus_parasanguinis     2153652.0       41.71   527.0   1218.0
output_ilona_PE_mm2/haybaler/haybaler_output$ head -n6 read_count_haybaler_short.csv
species chr_length      gc_ref  tmp_sample1_R1  tmp_sample2_R1  tmp_sample3_R1
Prevotella_melaninogenica       1796408.0       40.91   1837.0  2433.0  5022.0
Rothia_mucilaginosa     2264603.0       59.62   2661.0  514.0   5635.0
Prevotella_jejuni       2204592.0       41.25   4149.0  866.0   626.0
Veillonella_atypica     2071952.0       39.11   1628.0  425.0   2983.0
Fusobacterium_periodonticum     2222370.0       28.2    644.0   2954.0  475.0

minimap2 params:

  aligner = "minimap2short"    // Aligner software to use (bwamem, minimap2short, minimap2long, ngmlr)
  mismatches = "5"          // Exclude reads with x of more mismatches. Suggest 3 for short reads, 10000+ for long reads (integer)
  readType = "PE"             // SE single ended or PE paired end (SE, PE)
  mapping_quality = "mq30"    // Mapping quality filter. (mq20, mq30)

cheers

irosenboom commented 2 years ago
colindaven commented 2 years ago

I downloaded alternative PE metagenome data from EBI and can confirm the soft-clipping problem again.

Data : first 200k reads , 800k lines from

SRR13594152_200k_R1.fastq SRR13594152_200k_R2.fastq

CIGAR string show 126 soft clipped bases, 21 matches and 3 soft clipped bases. 126S21M3S

See https://www.drive5.com/usearch/manual/cigar.html

I'll try to optimize using this dataset since the tiny ones you gave me lead to 0 alignments with bwa-mem. I even get no alignments in the bam - they might have been eliminated by the non-supplementary non-secondary code I just recently introduced. I'll try to check this.

colindaven commented 2 years ago

I set the bwa mem -L parameter to -L 60. Few changes in soft clipping observed. I then set the -L parameter to -L 1000,1000. Still soft clipped reads and poor alignments are found.

eg this read pair - alignment built on 41 matches and 19 matches. This has a MQ of 40 so will likely be output ...


SRR13594152.65811       99      1_AP012320_1_Rubrivivax_gelatinosus_IL144_DNA__complete_genome_BAC      2545085 40      43S41M66S       =       2545404 338     CTCGGCCTCGACGGCGCA
SRR13594152.65811       147     1_AP012320_1_Rubrivivax_gelatinosus_IL144_DNA__complete_genome_BAC      2545404 40      124S19M7S       =       2545085 -338    CGCGC

Potentially we need to set the scoring differently for PE reads -T 60 ?, which is just weird. I thought bwa was a) the most tested aligner in the world and b) set up for PE reads. I guess it's mainly appropriate for genomes, not metagenomes, and tuned accordingly.

-T INT minimum score to output [30]

colindaven commented 2 years ago

@irosenboom please avoid using the fastp read trimmer with PE data at present, it is not feeding the data through to the aligner properly. Which explains why using minimap2short with PE and SE data led to identical results for me.

colindaven commented 2 years ago

After not seeing much improvement despite setting -T 60 (default is 30) for PE reads, I would like to disallow bwa mem for PE reads due to the strong variation in results. Is this ok with you Ilona ?

Details here:

https://github.com/MHH-RCUG/nf_wochenende/wiki/Troubleshooting#bwa-mem-with-paired-end-pe-reads

Also - this is a major issue - minimap2short attributes reads to common airway species from the datasets you gave me in PE mode, but bwa mem does not.

here mm2

species                                                                                             SRR11207337_mock_sm_R1  chr_length  gc_ref  tmp_sample1_R1
1_AE004091_2_Pseudomonas_aeruginosa_PAO1__complete_genome_BAC                                       86206.0                 6264404.0   66.57   0.0
1_AE006468_2_Salmonella_enterica_subsp__enterica_serovar_Typhimurium_str__LT2__complete_genome_BAC  52056.0                 4857450.0   52.21   0.0
1_U00096_3_Escherichia_coli_str__K_12_substr__MG1655__complete_genome_BAC                           50895.0                 4641652.0   50.77   0.0
1_CP011051_1_Bacillus_intestinalis_strain_T30__complete_genome_BAC                                  48550.0                 4031727.0   43.89   0.0
1_AE017262_2_Listeria_monocytogenes_str__4b_F2365__complete_genome_BAC                              38025.0                 2905187.0   38.04   0.0
1_AJ938182_1_Staphylococcus_aureus_RF122_complete_genome_BAC                                        34099.0                 2742531.0   32.77   0.0
1_AE016830_1_Enterococcus_faecalis_V583_chromosome__complete_genome_BAC                             31505.0                 3218031.0   37.52   0.0
1_AP008937_1_Lactobacillus_fermentum_IFO_3956_DNA__complete_genome_BAC                              17107.0                 2098685.0   51.46   0.0
1_CP023863_1_Prevotella_jejuni_strain_CD3_33_chromosome_I__complete_sequence_BAC                    0.0                     2204592.0   41.25   4149.0
1_AP011540_1_Rothia_mucilaginosa_DY_18_DNA__complete_genome_BAC                                     0.0                     2264603.0   59.62   2661.0
1_CP014787_1_Lactobacillus_oris_strain_J_1_chromosome__complete_genome_BAC                          2164.0                  3171194.0   51.36   0.0
1_CP002122_1_Prevotella_melaninogenica_ATCC_25845_chromosome_I__complete_sequence_BAC               0.0                     1796408.0   40.91   1837.0
1_CP020566_1_Veillonella_atypica_strain_OK5_chromosome__complete_genome_BAC                         0.0                     2071952.0   39.11   1628.0

here bwa mem - different datasets, no mock comm, but illustrates my point - tmp_sample should have some reads attributed.


species                           SRR13594152_200k_R1  chr_length  gc_ref  tmp_sample1_R1  tmp_sample2_R1  tmp_sample3_R1
Alicycliphilus_denitrificans      2207.0               4637013.0   68.3    0.0             0.0             0.0
Ignavibacterium_album             1302.0               3658997.0   33.93   0.0             0.0             0.0
Methylomonas_koyamae              397.0                4894002.0   56.43   0.0             0.0             0.0
Rubrivivax_gelatinosus            356.0                5043253.0   71.27   0.0             0.0             0.0
Methylosinus_trichosporium        332.0                4508832.0   66.05   0.0             0.0             0.0
Dechlorosoma_suillum              330.0                3806980.0   65.38   0.0             0.0             0.0
Sulfuritalea_hydrogenivorans      284.0                3802648.0   63.28   0.0             0.0             0.0
Mesorhizobium_oceanicum           283.0                5158483.0   65.21   0.0             0.0             0.0