DennisSchmitz / Jovian_archive

Metagenomics/viromics pipeline that focuses on automation, user-friendliness and a clear audit trail. Jovian aims to empower classical biologists and wet-lab personnel to do metagenomics/viromics analyses themselves, without bioinformatics expertise.
GNU Affero General Public License v3.0
18 stars 7 forks source link

Jovian crashes when there aren't enough reads matching the specified background organism (default: HuGo) #113

Closed DennisSchmitz closed 4 years ago

DennisSchmitz commented 4 years ago

Dataset: ENNGS "Viral_metagenomics" dataset

The ENNGS data is already cleaned of HuGo reads. Jovian assumes that there are at least some reads of the specified background organism. In this dataset, there aren't, so the required output files are not generated. This results in a unresolvable DAG for Snakemake which then crashes. Basically, one of the assumptions Jovian makes isn't true.

I have to develop a workaround for it, maybe touching the files?... But that might cause problems in MultiQC. Maybe the background organism workflow should be made optional, like the mgkit LCA method. However, since Jovian is intended for raw and unedited Illumina data, this falls outside the intended use-case, therefore I'm giving it low priority.

DennisSchmitz commented 4 years ago

Update: Also happens on the viral encephalitis dataset.

Same reason.

florianzwagemaker commented 4 years ago

After some extra searching through the files and datasets we've found that this is actually not caused by the lack of "background organism"-reads. Both datasets are able to complete all jobs related to deleting the background-organism reads without issues.

What happens with the particular ENNGS datasets is caused by the actual size of the datasets combined with the current 'strict'-mode of Jovian.

Jovian in strict mode requires that assembled scaffolds are at least 500 nucleotides long. However, the assembled sequences in these datasets are (often) smaller than 500nt, resulting in an empty fasta output file that is passed onto the downstream processes. Which obviously results in a crash.

Running this dataset in relaxed-mode "solves" the issue for this particular dataset.

As earlier noted, this falls outside of the intended use case of Jovian