ag-computational-bio / bakrep

GNU General Public License v3.0
4 stars 0 forks source link

Is it possible to give some demo examples? #1

Open Dx-wmc opened 1 month ago

Dx-wmc commented 1 month ago

hi, can you provide some examples for demonstration? The current introduction is a bit confusing to me.

lfenske-93 commented 4 weeks ago

Hi, what kind of examples would you like to see?

In general, this workflow is not necessarily intended to be reproduced. It was used to process the data set on this paper: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001421

And we are currently expanding it to process further data sets within the All The Bacteria project: https://allthebacteria.readthedocs.io/en/latest/

The already processed data from the workflow can currently be found in our web repository, which makes it easy to browse and download the data: https://bakrep.computational.bio/

If you are generally interested in seeing how the workflow is run or what the data structure that is entered must look like, I can go into this in more detail.

Greetings, Linda

Dx-wmc commented 4 weeks ago

Thank you for your patient reply. I would like to see a brief example of a nextflow running script, including the input metadata and corresponding results. This would be very helpful for me to configure and use.

lfenske-93 commented 3 weeks ago

Okay sure, I'll try to give a short example.

The nexflow script used could be found here: nextflow/661k.nf

The command to process the required data for the project was as follows:

 nextflow run .nextflow/661k.nf -c ./bakrep/nextflow/nextflow.config -profile cluster --samples /shared/new-run/metadata.tsv 
 --setupdir /mnt/scratch/ --data assemblies/ --results results/ -with-conda  

An example how the metadata.tsv looks like, can be found in the repository: metadata_ena_661K_filtered_head51.tsv Via the setupdir parameter you need to provide a path to the specific databases used by the different tools. Default paths are stored in the nextflow/config.nf.

The input data for the workflow consisted of the assembly FASTA files available at the following link: http://ftp.ebi.ac.uk/pub/databases/ENA2018-bacteria-661k/Assemblies/

For each processed assembly file, the following result files will be generated:

Assembly-statistics: sample.assemblyscan.json CheckM2 quality control: sample.checkm2.json Bakta annotation: sample.bakta.json, sample.bakta.ffn, sample.bakta.faa, sample.bakta.gbff.gz, sample.bakta.gff3 Taxonomic classification: sample.gtdbtk.json Multilocus sequence typing: sample.mlst.json

At the moment I work on a updated version of the worflow to process the latest data from the All the Bacteria project. If you are generally interested in the whole project you can take a look at the current updates and information here: https://allthebacteria.readthedocs.io/en/latest/faq.html