IKIM-Essen / uncovar

Transparent and robust SARS-CoV-2 variant calling and lineage assignment with comprehensive reporting.
https://ikim-essen.github.io/uncovar
BSD 2-Clause "Simplified" License
21 stars 8 forks source link

fix: move primer file location under workflow; this avoids file not being found in shared filesystems #665

Closed huzuner closed 2 weeks ago

huzuner commented 2 months ago

Description

Before this change, I suspect that the primer file cannot be found by the file system storage in the cluster. This leads to an error during module import of Uncovar pipeline. This PR moves the file under workflow/resources and fixes the file path in the config.yaml.

Related Issue

Checklist

huzuner commented 2 months ago

I see that I mistakenly used the primer for the test. However I can't find the SARS-CoV-2-artic-v4_1.primer.bed under resources. Is this already known or am I missing something?

huzuner commented 1 month ago

The latest commit to the master branch now uses nCoV-2019.primer.bed also in the main config. Then, this PR should be ready to go.

huzuner commented 1 month ago

I would rather put all the rules that require the primer bed in an if that checks whether the primers are defined in the config file. If it is missing, one can assume normal shotgun sequencing, so that no primers are needed anyway.

I initiated something above and checked the workflow for a while now and I'm a bit lost at the point where upstream of get_samtools_sort_input requires the primer file. Apparently, it is depending on stages "initial" and "hardclipped" and I don't have much of a clue how to add an if for an unclipped fasta there. Another file that is depending on the primer is results/{date}/corrected/{sample}/{sample}.correctedReads.clip.fasta and it is used in the function get_reads_by_stage() and there are additional stages as well. I am wondering whether I understand correctly or if there is a simpler way of configuring the whole workflow to include a check for the primer.

huzuner commented 1 month ago

Also, I find it a bit difficult to see how that will fix the file not found error when executing the workflow on the cluster. When using non-shotgun sequencing samples, if the primer file is not being found, it will assume shotgun sequencing, why would we want to allow that? or am I missing something here?

johanneskoester commented 1 month ago

Right, not as easy as I thought. The info whether a sample is amplicon or not is denoted in the sample sheet column is_amplicon_data.

I think it should be fine to just keep the original approach that takes the bed file from the config, and just fix this here:

rule bed_to_bedpe:
    input:
        check_bed_for_URL(config["preprocessing"]["amplicon-primers"]),

Basically, I would move this check into the script of this rule instead. Otherwise, the pipeline is also not platform independent because users might want to pull stuff e.g. from S3. I believe, then one does not even need to change anything else. Just, when running the pipeline as a module, one has to specify the bed file in the correct location and cannot rely on it to be delivered with the pipeline. The latter could be circumvented by allowing to specify a URL for the bed file instead of a file path. However, this is only needed if one really has amplicon data. For normal WGS, it should just run because the path denoted in the config file will never be checked. If it is still checked somewhere else, we can fix that.

huzuner commented 4 weeks ago

Then, I hope my latest changes upon your suggestion fixes the issue.

johanneskoester commented 3 weeks ago

I think this fix is needed: https://github.com/IKIM-Essen/uncovar/pull/668

huzuner commented 2 weeks ago

I just disabled the formatting as the fork has problems checking out the branch.