Closed evanroyrees closed 2 years ago
Will be merging by EOTD tomorrow unless an issue is raised
Few things to address, I'll probably edit this with more notes...
storedir for process PREPARE_LCA
will be wherever nextflow is running from, should be variable
because outdir defaults to baseDir
and the output directory name is dyanmic, there is no way to gitignore the output directory, would have to individual gitignore all output file patterns which seems a bit dangerous
-https://github.com/KwanLab/Autometa/blob/0cba818136f921289a961008e29e81c50f9d86b0/nextflow.config#L40
- because
outdir
defaults tobaseDir
and the output directory name is dynamic, there is no way to gitignore the output directory, would have to individual gitignore all output file patterns which seems a bit dangerous https://github.com/KwanLab/Autometa/blob/0cba818136f921289a961008e29e81c50f9d86b0/nextflow.config#L40
Would you suggest hardcoding a default output directory here? Or making this a required parameter for the end user? I think the typical end-user behavior will be to fill in this parameter as they will want to place their analyses in a specific location.
Hardcoding the outdir
would be a simple fix, right? Should we go this route?
for example
outdir = nf-output
- storedir for
process PREPARE_LCA
will be wherever nextflow is running from, should be variable
I'm not sure what you mean by storeDir
being variable. Do you mean something like caching these precomputed dbs to params.outdir
?
Hardcoding the outdir would be a simple fix, right? Should we go this route?
Yeah I think that would be fine.
I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?
Right now it's hardcoded and so will always be created based on where nextflow is run from Should be a parameter, could maybe default to a folder under params.outdir
As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz, PREPARE_LCA
should be aware right? May have to keep track of file hashes
Should probably add a check at the start of the pipeline and fast fail if $outdir/whatever
isn't empty?
Should probably add a check at the start of the pipeline and fast fail if
$outdir/whatever
isn't empty?
I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.
I'm not sure what you mean by storeDir being variable. Do you mean something like caching these precomputed dbs to params.outdir?
Right now it's hardcoded and so will always be created based on where nextflow is run from Should be a parameter, could maybe default to a folder under params.outdir
As an aside, how is the LCA stuff kept in check with the nr db? If someone downloads a new nr.gz,
PREPARE_LCA
should be aware right? May have to keep track of file hashes
I think this is as intended for wherever nextflow is run b/c these LCA dbs can be used across runs of different datasets. If I am recalling correctly, if the ncbi databases change, the cached LCA databases will be regenerated (I think nextflow is already doing some of this file hash tracking behind the scenes here).
Should probably add a check at the start of the pipeline and fast fail if
$outdir/whatever
isn't empty?I'm not convinced this is necessary at the moment. This could either be a job for nextflow or could cause some problems if the end-user is not using nextflow properly. We'll kick the can down the road for now.
IMO- Because this PR removes the provenance (run ID) from the output I think either the pipeline should fail if files already exist or the provenance should be written as an output. This may be a larger issue but this PR does take a step backwards in data provenance
Happy to add that in for the 2.1.0 release π
Nextflow output structure now resembles what was discussed in #160. Note, metagenomes are not enumerated for generation of their output directory name, their
meta.id
is used which is themetagenome.simpleName
(groovy method) from the respective input metagenome. This clobbers filenames with multiple.
. For example, themeta.id
ofmy.example.metagenome.fasta
would bemy
. Moving forward (looking at #186), altering the input s.t. a sample sheet is provided would use values in this table to check for unique sample IDs to write each sample to its respective sample ID results directory.storeDir
for fetching mock_data genomeshmmsearch
nf processes still need to be fixednextflow_schema.json
nextflow.config
nf-core settings to silence linter warnings