Empty contaminated.fa and failed in preprocessing:mapToContamFa process

Pathogen-Genomics-Cymru / lodestone

Mycobacterial pipeline

GNU Affero General Public License v3.0

12 stars 3 forks source link

Empty contaminated.fa and failed in preprocessing:mapToContamFa process #17

Closed alantsangmb closed 8 months ago

alantsangmb commented 1 year ago

Hello, I found that tb-pipeline would be very useful to our lab. However, the pipeline terminated at the mapToContamFa step. And I found that a contam_dir was created in the work directory and GCF_000001405.39_GRCh38.p13_genomic.fna.gz (~920Mb) was downloaded at the dowloadConta step.

After this step, a contaminants.fa file was created and the contam_dir was removed. However, the contaminants.fa is empty and the pipeline terminated.

I am not sure the pipeline failed was due to the empty contaminants.fa file. Is there any reason why the file is empty? Is it possible to skip the contaminants mapping step?

And I suppose the human reads were processed using bowtie2 against hg19_1kgmaj ? Why GCF_000001405.39_GRCh38.p13_genomic.fna.gz was downloaded again?

Thanks in advance for any reply.

alantsangmb commented 1 year ago

I am attaching the *species_in_sample.json file to see if there is any hints. Thank you. 22006477_S3_L001_species_in_sample.txt

annacprice commented 1 year ago

Hi Alan,

Many thanks for your interest in our pipeline and reporting the errors you found.

We have also observed the errors you reported with the empty contaminants.fa file and preprocessing:identifyBacterialContaminants incorrectly identifying human reads as a contaminant to remove.

The pipeline is currently undergoing module testing and end-to-end testing, and there are also other bugs we have identified that we will fix.

We are intending to release a new version of the pipeline in the next of couple weeks, which will include all the bug fixes mentioned above.

annacprice commented 1 year ago

https://github.com/Pathogen-Genomics-Cymru/tb-pipeline/issues/20 https://github.com/Pathogen-Genomics-Cymru/tb-pipeline/issues/21

WhalleyT commented 1 year ago

Hi @alantsangmb, I've had a similar issue regarding the contaminents.fa file being empty and it only seems to happen when running wget on Slurm. Are you running the pipeline locally or on a HPC scheduler? I can change the way the pipeline handles errors in these steps to make it more verbose but I thought it would be useful to get a handle on how you're running the script.

Cheers, Tom

alantsangmb commented 1 year ago

Hi @WhalleyT, Thank you for your reply. I run it locally. The empty file might be due to incomplete download of the rather big GCF_000001405.39_GRCh38.p13_genomic.fna.gz file.

annacprice commented 1 year ago

Hi @alantsangmb, the new 0.9.5 release of the pipeline fixes several bugs and should resolve the issues you're seeing

alantsangmb commented 1 year ago

Hi @annacprice Thank you so much! I have downloaded version 0.9.5 and tested on a few sample data. The workflow has completed successfully.

And one more question, what would the default amr catalogues that the pipeline use? I ask because vcfpredict:gnomonicus terminated if I use this version of catalogues (https://github.com/oxfordmmm/tuberculosis_amr_catalogues/blob/public/catalogues/NC_000962.3/NC_000962.3_WHO-UCN-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv) by setting the parameter amr_cat.

annacprice commented 1 year ago

The default catalogue as set by params.amr_cat in the config is NC_000962.3_WHO-UCN-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv

The tuberculosis_amr_catalogues repo is available in the preprocessing container, at the path /tuberculosis_amr_catalogues

We have also been seeing issues with gnomonicus failing for some samples