biosustain / dsp_nf-metagenomics

Shotgun metagenomics pipeline to process microbiome samples
GNU General Public License v3.0
3 stars 0 forks source link

Removing host_genome using FASTQ sample from MGnify #2

Open marcoreverenna opened 7 months ago

marcoreverenna commented 7 months ago

The following command line has been used to run the pipeline: nextflow run main.nf -profile az_test -w az://orange -ansi-log false -resume -with-dag dag.png


This error message occurs after removing the the path host_genome in the process QC and index_ch in the workflow.

This error message occurs after removing the --reference-db host_genome in kneaddata. Seemed the pipeline was running correctly, it did not break since the beginning.

Command executed:
kneaddata -i1 ERR1713346_1.fastq.gz -i2 ERR1713346_2.fastq.gz --threads 8 --output . --bypass-trim

mkdir -p kneaddata_logs
mv ERR1713346_1_kneaddata.log kneaddata_logs/

Command error:
ERROR: Unable to write file: /mnt/batch/tasks/workitems/job-101f51bdea810a457fef-QC/job-1/nf-02b3c0b0a2d436eb29a216b10ec57dd0/wd/reformatted_identifierskxgtnfmc_decompressed_7533av6e_ERR1713346_1
apalleja commented 7 months ago

Hi Marco,

I think the problem is caused by the space in the reads header: @ERR1713338.1 J00138:63:HCNWCBBXX:1:1101:3772:1103/1

This may cause that the identifiers can not be reformatted correctly. A quick fix is replacing the space by an underscore on the header; e.g sed s/\ /_/g ERR1713338_1.fastq > ERR1713338fixed_1.fastq

Perhaps a long-term solution is creating a module to substitute the space or perhaps substituting Kneaddata by other software where we have more flexibility and can separate the tasks (adapter removal, trimming, host removal, ...). Thinking about ...