Can BWA use aln to shorter than 70bp and mem to longer using bam_pipline

jiangzy26 commented 1 day ago

Hi, I am running paleomix bam_pipeline, I want to use bwa aln when the reads shorter than 70bp, and when read longer than 70bp, using bwa mem algorithm, I want to set this, could you share your help? Thank you very much.

jiangzy26 commented 1 day ago

I found when I use bwa aln algorithm, the output bam file is so small than the output of bwa mem, then, I checked the average read length of the output 'reads.settings' of paleomix, I found the following, but something weired is my raw sequence read is 150bp, why the average read length of paleomix result is longer than the original 150bp, could you share your help? Thank you very much.

Average length of retained reads: 145.864 Average length of retained reads: 143.345 Average length of retained reads: 151.938 Average length of retained reads: 149.96 Average length of retained reads: 143.507 Average length of retained reads: 149.509 Average length of retained reads: 151.975 Average length of retained reads: 148.551 Average length of retained reads: 144.634 Average length of retained reads: 141.026 Average length of retained reads: 148.991 Average length of retained reads: 172.327 Average length of retained reads: 143.878 Average length of retained reads: 157.156 Average length of retained reads: 145.448 Average length of retained reads: 141.321 Average length of retained reads: 157.011 Average length of retained reads: 143.165 Average length of retained reads: 166.618 Average length of retained reads: 154.032 Average length of retained reads: 177.617 Average length of retained reads: 147.66 Average length of retained reads: 141.177 Average length of retained reads: 150.723 Average length of retained reads: 157.71 Average length of retained reads: 163.685 Average length of retained reads: 159.29 Average length of retained reads: 182.8 Average length of retained reads: 180.887 Average length of retained reads: 175.577

jiangzy26 commented 1 day ago

When I don't provide the adapters, the reads.collapsed.gz is ~4.5G, the reads.pared1.truncated.gz is ~2G When I provide the adapters, the reads.collapsed.gz is ~5G, the reads.pared1.truncated.gz is ~316M, I am confused by the result, whether I should provide the adapters, could you share your help? Thanks

MikkelSchubert commented 2 hours ago

Hi,

Regarding your first question, do you want to select the aln/mem algorithm based on the length of the raw FASTQ reads or after trimming has been performed?

If it is based on the raw data, then you can set the alignment algorithm per sample or per library as described here:

https://paleomix.readthedocs.io/en/stable/bam_pipeline/makefile.html#overriding-global-settings

If, on the other hand, you'd like to select the alignment algorithm based on the average length after trimming, then this is currently not possible in paleomix.

Regarding your second question, this is because of read merging/collapsing resulting in longer reads:

When the --collapse option is used, AdapterRemoval will attempt to merge PE reads that are overlapping at least 11bp (by default), which results in a maximum reads length of 2x your input read length - 11 bp of overlap, which would for example be 289 bp if your input was 150 bp PE reads.

Regarding your third question, I would strongly recommend that you manually specify the adapters, if do not match the default (Illumina) sequences. The file sizes you shared indicates that there are a significant number of insert sequences in your data that are shorter than your read length. The result of this is that trimming and (optionally) merging removes a lot of your input data. However, with the wrong adapter sequences you would get many false negatives, where adapter sequences isn't trimmed from the reads.

See here for an overview of how AdapterRemoval uses the adapter sequences you provide to carry out trimming/merging:

https://adapterremoval.readthedocs.io/en/v3.0.0-alpha2/detailed_overview.html#read-processing

Best, Mikkel

MikkelSchubert / paleomix

Can BWA use aln to shorter than 70bp and mem to longer using bam_pipline #58