NBChub / bgcflow

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)
https://github.com/NBChub/bgcflow/wiki
MIT License
27 stars 7 forks source link

How to use Custom input directory? #349

Open 540889956 opened 1 month ago

matinnuhamunada commented 1 month ago

Hi, thanks for reaching out and using BGCFlow :)

I just wish to know how to use Custom input directory

To use custom input directory, you will need to set up two things:

Please find the example here: config.zip

This is how the project structure will look like:

config/
├── Lactobacillus_delbrueckii
│   ├── input_files # your custom input directory
│   │   └── my_custom_genome.gbk
│   ├── gtdbtk.bac120.summary.tsv # an optional GTDB-tk style taxonomic assignment
│   ├── project_config.yaml # project level configuration, the rule set here will override the global parameter
│   └── samples.csv
└──── config.yaml # global parameter configuration

And this is how the project configuration (project_config.yaml) looks like:

name: Lactobacillus_delbrueckii_custom_input

pep_version: 2.1.0

description: "An example of using custom input files in BGCFlow projects."
input_folder: input_files # This is the folder where the input files are located, relative to this file.
input_type: gbk # This is the default type of input files. It can be gbk or fna. Note that samples from NCBI will default to fna format.
gtdb-tax: gtdbtk.bac120.summary.tsv # you can also provide a custom GTDB-tk output style taxonomy information
sample_table: samples.csv

#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
  seqfu: TRUE
...

and finally, you need to add the custom sample in the samples.csv:

genome_id source organism genus species strain closest_placement_reference input_file
GCA_000056065.1 ncbi
GCA_000182835.1 ncbi
GCA_000191165.1 ncbi
GCA_000014405.1 ncbi
strain_01 custom my_custom_genome.gbk

and you should get this message:

Step 2.1 Getting sample information from: config/Lactobacillus_delbrueckii/project_config.yaml
 - Processing project [config/Lactobacillus_delbrueckii/project_config.yaml]
 - Custom input directory: True
 - Getting input files from: /datadrive/bgcflow/config/Lactobacillus_delbrueckii/input_files
 - Custom input format: True
 - Default input file type: gbk
   - ! WARNING: GCA_000056065.1 is from ncbi. Enforcing format to `fna`.
   - ! WARNING: GCA_000182835.1 is from ncbi. Enforcing format to `fna`.
   - ! WARNING: GCA_000191165.1 is from ncbi. Enforcing format to `fna`.
   - ! WARNING: GCA_000014405.1 is from ncbi. Enforcing format to `fna`.
 - Found user-provided taxonomic information

Why this workflow can still running after I delete all the input files?

I would assume that you don't change anything in the example template configuration and only deleted the default input files located in data/raw. If this is the case, then the project can still run because all the samples are being fetched online from NCBI. You can check if this is the case from the samples.csv

Thank you again for the question, we will be sure to add this to the FAQ section and improve the WIKI.

matinnuhamunada commented 1 month ago

Hi WJ,

Glad to hear it works :)

The CLI for bgcflow run is just a simple wrapper of the snakemake CLI. So you can always directly use snakemake and use whatever parameter is available in snakemake documentation.

If you prefer to use the bgcflow_wrapper CLI, you can check what parameter is available using the help command, such as:

$ bgcflow run --help
Usage: bgcflow run [OPTIONS]

  A snakemake CLI wrapper to run BGCFlow. Automatically run panoptes.

Options:
  -d, --bgcflow_dir TEXT  Location of BGCFlow directory. (DEFAULT: Current
                          working directory.)
  --workflow TEXT         Select which snakefile to run. Available
                          subworkflows: {BGC | Database | Report | Metabase |
                          lsagbc | ppanggolin}. (DEFAULT: workflow/Snakefile)
  --monitor-off           Turn off Panoptes monitoring workflow. (DEFAULT:
                          False)
  --wms-monitor TEXT      Panoptes address. (DEFAULT: http://127.0.0.1:5000)
  -c, --cores INTEGER     Use at most N CPU cores/jobs in parallel. (DEFAULT:
                          8)
  -n, --dryrun            Test run.
  --unlock                Remove a lock on the snakemake working directory.
  --until TEXT            Runs the pipeline until it reaches the specified
                          rules or files.
  --profile TEXT          Path to a directory containing snakemake profile.
  -t, --touch             Touch output files (mark them up to date without
                          really changing them).
  -h, --help              Show this message and exit.

Note that the current bgcflow_wrapper package is using an older snakemake version and we are currently working on the update.

Also, there is another question. After bcgflow build report, the code '# use conda or mamba mamba env create -f bgcflow_notes.yaml # or r_notebook.yaml' doesn't work

I hope this means the command bgcflow build report works and you want to manually edit the notebook templates?

By snakemake convention, the environment files can be found in workflow/envs/<environment name>.yaml. Therefore, you can create the conda environment using mamba env create -f workflow/envs/bgcflow_notes.yaml.

You can actually reuse the conda environment built by snakemake by checking the snakemake log. They can be found in the .snakemake/conda folder.

PS: If you find any misleading or wrong instruction in the WIKI, please do let us know to correct it.

540889956 commented 2 weeks ago

Hi,

Thanks very much for the help.

I got another problem when the workflow install the env for roary. shows below: Output: Channels:

LibMambaUnsatisfiableError: Encountered problems while solving:

I can use roary in individual conda env , but when I export the yaml to replace the yaml in the workflow it will generate new errors. So may I ask how to modify the yaml to solve this?

Thanks for the help!

Best Regards, Jay

matinnuhamunada commented 2 weeks ago

Hi Jay,

Unfortunately I cannot reproduce the error for creating the roary environment and the test seems to work fine.

From the message: package r-ggplot2-3.3.6-r42h6115d3f_0 is excluded by strict repo priority, it seems that you have your conda channel priority to strict.

Can you check your conda channel priorities and set it to flexible to see if it solves the problem? A detailed instruction is available in the wiki

After setting the priority to flexible, while running the snakemake jobs, you should see this warning message nagging you about it, which is fine:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Creating conda environment workflow/envs/roary.yaml...
Downloading and installing remote packages.
Environment for /datadrive_cemist/test/workflow/rules/../envs/roary.yaml created (location: .snakemake/conda/b39a961a250810ddef5ab2698703b6ab_)
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Singularity containers: ignored

We will probably remove roary with other newer pangenome builder. Hopefully there will be support to use singularity containers in the future for better reproducibility.