suggestions for initial release

m-jahn commented 2 months ago

The pipeline looks really good, everything is well-documented! I tested it after a fresh download and I had some problems to install the conda envs. Particularly the nanoplot and the cutesv env failed to install with conda (not mamba!). But this is probably a problem of conda which is just terribly slow and inefficient. With micromamba, the envs seem to build fine.

Here are some minor suggestions/ fixes for the pipeline that you may include or not:

URLs for model files for clair3 are hardcoded in common.smk. I would suggest to move the URLs to the config file. This is the line I'm talking about:

def download_model_for_clair3(wildcards):
    path2model = {
        "r1041_e82_400bps_sup_v4.2.0": "https://cdn.oxfordnanoportal.com/software/analysis/models/clair3/r1041_e82_400bps_sup_v420.tar.gz",
        "r1041_e82_400bps_sup_v4.3.0": "https://cdn.oxfordnanoportal.com/software/analysis/models/clair3/r1041_e82_400bps_sup_v430.tar.gz"
    }

Better make an entry in the yaml that is like:

clair3:
  threads: 10
  model: "https://cdn.oxfordnanoportal.com/software/analysis/models/clair3/r1041_e82_400bps_sup_v420.tar.gz"
  params:
    - "--include_all_ctgs"

Why are there two models indicated? Are both needed? If not remove the 2nd one or call it "fallback". Maybe mention the option to import different models in the README (just one extra line).

Running snakefmt ./ on the workflow makes a couple of formatting changes in the snakemake rules. However the automatic test on github seems to run fine with snakefmt. I don't see how there can be a discrepancy, other than versions that changed? I never had that before, so whatever failed locally should also fail the github CI test. Particularly one warning that snakefmt gives could be addressed:

[WARNING] In file "workflow/rules/quality_control.smk":  Keyword "shell" at line 76 has comments under a value.
        PEP8 recommends block comments appear before what they describe
(see https://www.python.org/dev/peps/pep-0008/#id30)

Looking at the most recent GH actions log for formatting, it says:

2024-05-05 15:16:30 [WARN]   No files were found in the GITHUB_WORKSPACE:[/github/workspace] to lint!

The workflow expects pandas is installed, i.e. it is not only required in the envs, but already at execution of the wf together with snakemake. I had this issue before and it failed the automatic test on github from what I remember. Anyway, not much you can do to workaround this except reading in the csv samplesheet with base python. It seems to pass the tests so just leave it as it is.
The GH actions workflow for linting seems to run on the top level dir. I think directory should also be .test instead of .

m-jahn commented 2 months ago

With mamba as package manager the pipeline finished without errors in 10 minutes. Very good output, nothing to complain really! You could consider removing the all-temp flag in the README code execution part, because that removes most of the ouput.

tfwulff commented 1 month ago

Thanks for your feedback! The following suggestions have been addressed:

Workflow rules have been formatted according to the latest snakefmt version (v0.10.2).
pandas is a dependency of snakemake (see here) and should be installed when creating the environment. Nevertheless, the installation instructions have been updated to explicitly install pandas.
The directory for linting has been adjusted to .test.
The --all-temp flag has been removed in the README.

In the future, the following suggestions will be addressed:

Use of mamba instead of conda
Transfer of model URLs for clair3

MPUSP / snakemake-ont-bacterial-variants

suggestions for initial release #1