KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

🍏 Change Nextflow I/O behavior #218

Closed evanroyrees closed 2 years ago

evanroyrees commented 2 years ago

Nextflow output structure now resembles what was discussed in #160. Note, metagenomes are not enumerated for generation of their output directory name, their meta.id is used which is the metagenome.simpleName (groovy method) from the respective input metagenome. This clobbers filenames with multiple .. For example, the meta.id of my.example.metagenome.fasta would be my. Moving forward (looking at #186), altering the input s.t. a sample sheet is provided would use values in this table to check for unique sample IDs to write each sample to its respective sample ID results directory.

evanroyrees commented 2 years ago

Will be merging by EOTD tomorrow unless an issue is raised

chasemc commented 2 years ago

Few things to address, I'll probably edit this with more notes...

evanroyrees commented 2 years ago

Would you suggest hardcoding a default output directory here? Or making this a required parameter for the end user? I think the typical end-user behavior will be to fill in this parameter as they will want to place their analyses in a specific location.

Hardcoding the outdir would be a simple fix, right? Should we go this route?

for example

outdir = nf-output
evanroyrees commented 2 years ago
  • storedir for process PREPARE_LCA will be wherever nextflow is running from, should be variable

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

chasemc commented 2 years ago

Hardcoding the outdir would be a simple fix, right? Should we go this route?

Yeah I think that would be fine.

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

Right now it's hardcoded and so will always be created based on where nextflow is run from Should be a parameter, could maybe default to a folder under params.outdir

As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz, PREPARE_LCA should be aware right? May have to keep track of file hashes

chasemc commented 2 years ago

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

evanroyrees commented 2 years ago

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.

evanroyrees commented 2 years ago

I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?

Right now it's hardcoded and so will always be created based on where nextflow is run from Should be a parameter, could maybe default to a folder under params.outdir

As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz, PREPARE_LCA should be aware right? May have to keep track of file hashes

I think this is as intended for wherever nextflow is run b/c these LCA dbs can be used across runs of different datasets. If I am recalling correctly, if the ncbi databases change, the cached LCA databases will be regenerated (I think nextflow is already doing some of this file hash tracking behind the scenes here).

chasemc commented 2 years ago

Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever isn't empty?

I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.

IMO- Because this PR removes the provenance (run ID) from the output I think either the pipeline should fail if files already exist or the provenance should be written as an output. This may be a larger issue but this PR does take a step backwards in data provenance

evanroyrees commented 2 years ago

Happy to add that in for the 2.1.0 release πŸ‘