StevenWingett / FastQ-Screen

Detecting contamination in NGS data and multi-species analysis
https://stevenwingett.github.io/FastQ-Screen/
GNU General Public License v3.0
62 stars 15 forks source link

With BWA, R1 & R2 files can have very different mapping signatures #36

Closed cbird808 closed 2 years ago

cbird808 commented 3 years ago

When the aligner is set to bwa, in two separate data sets we are seeing some libs with very different "hit profiles" or "mapping signatures" between R1 and R2 files. For example, R1 may have 40% of reads map to our fish genome fasta, and R2 from the same library can have 0%. We have not observed this behavior when the aligner is set to bowtie2.

I've attached two report files, one from bwa and one from bowtie2. Realize that we are still optimizing the slurm array that runs fastq_screen, which explains why some results are missing. I will get the exact options and arguments that were used from the person running the analyses and post them.

The fastq files were previously processed with fastp to remove questionable reads (<140 bp after trimming, low complexity were filtered out) and clumpify.sh to remove PCR and optical duplicates, so the data fed to fastq_screen should be fairly pristine.

bwa_vs_bowtie2.zip

cbird808 commented 3 years ago

I just noticed that the databases are different (another member of our team is running fastq_screen). That could be the reason for the problem. I'll track that down.

bwa_vs_bowtie2_scripts_configs.zip

cbird808 commented 3 years ago

All of the databases are copies of one another, with the exception of the human. Anyway, it seems there is some odd behavior when the aligner is set to bwa.

OdysseyXD commented 3 years ago

Here I met another problem when the aligner is set to bwa . The mapping results of fastq_screen (overall 36% of my target genome, 62% of GRCh38, 12.5% of E.coli ) are quite different with samtools flagstat (overall 76% of my target genome, 95% of GRCh38, 13.86% of E.coli ) after bwa mem -t 12 -a (which is the same as the default parameters of fastq_screen --aligner bwa)

Since it's a chipseq data, so I took the same steps to analyze the Input data and the mapping results were still largely different. The fastq_screen shows no contaminants, while samtools flagstat after bwa mem -a gave me ~70% overall mapping rates to GRCh38 and 99.63% overall mapping rate to my target genome (Sscrofa11.1 )

When the aligner is set to bowtie2. The mapping results of these two fq data were consist with fast_screen --aligner bwa and with samtools flagstat after bowtie2 -k2 --very-fast-local . So I suppose it may be the problem of that samtools flagstat not fit bwa mem?

I'm new to bioinformatics and have been confuse by the different mapping results of bwa and bowtie2 for a very long time. I've searched and tried the method described in https://www.biostars.org/p/117225/ , but bowtie2 -N1 --very-sensitive-local still gave me a very low mapping rate (36%) while bwa mem pe mode gave me 82%.

So far, the result of bowtie2 is trustworthy to me.

And I start to think about "Is those research data that already published trustable? Since there must exits some people, that new to bioinformatics like me, to do the analysis work"

StevenWingett commented 3 years ago

Hi,

Interesting. To comment on the first point, my guess is that there is some kind non-genomic sequence towards the start of your lowly mapped reads read. If you remove, say the first 20bps from the start of the read, are you now able to map with BWA?

Other than that, do the R1 / R2 reads look similar in FASTQC?

Thanks, Steven

cbird808 commented 3 years ago

Thank you for the suggestions. There's nothing noticeably different between the R1 and R2 reads in the libs where this is happening. We reprocessed the data prior to employing fastq_screen, cleaned up the contaminant databases, and are running both bwa and bowtie2 again. If we see the same results, we'll pull some reads, blast them and scrutinize the ends of the reads.

cbird808 commented 3 years ago

question, do we need the fasta files from the background contaminant genomes, or just the indexes of the fasta files? Disk space limitations on our hpc inspired us to only put the index files in the dir of background contaminant genomes. Given that we obtained the odd results above, I'm wondering if leaving the fastas out of the background dir caused a problem.

StevenWingett commented 3 years ago

Hi, Just saw your message. For the filtering you only need the index files i.e. you don't need the FASTA files used to generate the index files.

Many thanks, Steven