Microbial-Ecology-Group / AMRplusplus

AMR++ is a bioinformatic pipeline meant to aid in the analysis of raw sequencing reads to characterize the profile of antimicrobial resistance genes, or resistome.
https://www.meglab.org/
GNU General Public License v3.0
28 stars 12 forks source link

Silent bug in host removal functions #48

Open patriciatran opened 2 days ago

patriciatran commented 2 days ago

Foremost, thank you for making this pipeline!

I noticed unexpected silent bug when using a custom --host for host removal. In summary, even if I provided --host (or --reference) parameters, nextflow/amr++ was not reporting errors, but also not using the correct host for removal. Additionally, it appears that providing a --host_index is mandatory (not optional) when using the host removal functions. As a result, this might mean that the user thinks the pipeline is removing their custom host, but in reality isn't. This can impact interpretation of the data.

I included some reproducible examples below to explain.

I did find a solution (see below), but I just wanted to let you know of this potential hidden bug and provide some suggestions & contributions:

My suggestions are to:

1) Consolidate the documents to remove discrepencies in how-to use the program (e.g. --host vs usage of --reference)

2) Clarify the use of --host_index either in the params.config file or in the documentation (example here)

3) Modify the code to use --host_index and --index to properly read in and use the null index parameter?

I will send a pull request for my suggestion 1. I am unsure how to address 2 (see what I tried below) I attempted to look into 3 but didn't solve it yet.

Please let me know if I can contribute/test things for this codebase.

Best,

Patricia

--

All these tests are performed using the test data provided in AMRplusplus/data Nextflow: N E X T F L O W \~ version 23.10.1 The test genomes I used are the default chr21 in /data/host and the bovine and chicken reference genome from NCBI.

I first tested how using --host or --reference influenced the result, to resolve the discrepency in the doc linked above.

Control:

nextflow run main_AMR++.nf -profile conda  --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --host "data/host/chr21.fasta.gz" --output "rm_host_default"

Removing bovine genome host using --host

nextflow run main_AMR++.nf -profile conda --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "rm_host_using--host"

Removing bovine genome host using --reference

nextflow run main_AMR++.nf -profile conda --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "rm_host_using--reference"

Results

All worked without printing any errors, but the outcomes are unexpected: all report the same number of mapped & unmapped reads.

## Default ## 
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 40007   1288    38719
S1_test 60008   1496    58512
S3_test 60021   2607    57414

## Using --host ##
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 40007   1288    38719
S1_test 60008   1496    58512
S3_test 60021   2607    57414

## Using --reference ##
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 40007   1288    38719
S1_test 60008   1496    58512
S3_test 60021   2607    57414

Second strategy: changing the host

To test whether this was caused somehow by the genome of choice, I tested using the chicken genome as a host.

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/315/GCF_000002315.5_GRCg6a/GCF_000002315.5_GRCg6a_genomic.fna.gz
nextflow run main_AMR++.nf -profile conda  --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "rm_host_using--host-chicken"
nextflow run main_AMR++.nf -profile conda  --pipeline rm_host --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "rm_host_using--reference-chicken"

Results

Once again the same values are reported. More importantly, the same number of reads mapped + unmapped are reported as when we used the chr21 and the bovine genome, which is unexpected. No errors are reported by Nextflow/AMR++.

## chicken using --reference ##
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 40007   1288    38719
S1_test 60008   1496    58512
S3_test 60021   2607    57414

## chicken using --host ##
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 40007   1288    38719
S1_test 60008   1496    58512
S3_test 60021   2607    57414

Third test:

To test whether this was due to the --pipeline rm_host function, I decided to run --pipeline standard_AMR instead.

nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --host "data/host/chr21.fasta.gz" --output "standard_all_default"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "standard_bovine_host"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "standard_bovine_reference"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --host "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "standard_chicken_host"
nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --reads "data/raw/*_R{1,2}.fastq.gz" --reference "GCF_000002315.5_GRCg6a_genomic.fna.gz" --output "standard_chicken_reference"

Results

Unfortunately, we get the same numbers:

cat standard_all*/Results/Stats/host.removal.stats
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 27045   1258    25787
S1_test 34082   1446    32636
S3_test 59577   2600    56977
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 27045   1258    25787
S1_test 34082   1446    32636
S3_test 59577   2600    56977
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 27045   1258    25787
S1_test 34082   1446    32636
S3_test 59577   2600    56977
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 27045   1258    25787
S1_test 34082   1446    32636
S3_test 59577   2600    56977
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 27045   1258    25787
S1_test 34082   1446    32636
S3_test 59577   2600    56977

Fourth test:

In my next series of tests, I'm straight up deleting chr21.fasta from the /data folder and seeing if the command works. Nextflow is now properly reporting an error, and tells me that's it's expecting some index and it's not working! Upon further inspection, I found in theparams.config code these lines telling us to also use the --host_index genome, however this is not reported in the usage.md and configuration.md files.

To put that in practice, I tested:

--host_index null
--host_index NULL
--host_index 'null'
--host_index ""

However, none were giving me the expected results. Nextflow couldn't read the null parameter properly and AMR++ was telling me I needed an index file (so the loop about if param = null --> bwa index doesn't automatically go into effect)

Alternative solution

I don't know if I can/have time to fully troubleshoot the code at the moment, but if I do I will send a pull request. For now, I am running bwa prior to running the pipeline, and simply providing BOTH custom --host_index and --host parameters.

conda activate bwa
# if not done already: conda install -c bioconda bwa 
bwa index /staging/ptran5/GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz
conda deactivate

Nextflow amr++ command using custom host but with default sample (S1, S2 and S3) data:

nextflow run main_AMR++.nf -profile conda --pipeline standard_AMR --host_index "/staging/ptran5/GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz*" --reads "data/raw/*_R{1,2}.fastq.gz" --host "/staging/ptran5/GCF_002263795.3_ARS-UCD2.0_genomic.fna.gz" --output "standard_bovine_reference_with_noindex"

Successful results:

(base) $ head rm_host_using--host_staging/Results/Stats/host.removal.stats
Sample  NumberOfInputReads  Mapped  Unmapped
S2_test 27044   20757   6287
S1_test 34081   21576   12505
S3_test 59571   43946   15625