Structured Output Folders

NBISweden / Earth-Biogenome-Project-pilot

Assembly and Annotation workflows for analysing data in the Earth Biogenome Project pilot project.

https://www.earthbiogenome.org/

GNU General Public License v3.0

9 stars 8 forks source link

Structured Output Folders #42

Open mahesh-panchal opened 9 months ago

mahesh-panchal commented 9 months ago

Aim

To make data findable, and rigid folder structure of output is needed that clearly describes content, version, and logical grouping.

Desired Features

Minimal folders to navigate.
A partition between results (data we want to keep) for public archiving, and other results that that don't need to be publicly archived.
Some kind of clear versioning method so we can keep older results, but know they're not the latest, and conversely some way to mark more recent runs which will not be used for further analyses.
A versioning method that makes it clear which former folders a tool run refers to (e.g. busco was run on build 3 of hifi-asm haplotypes).

Decisions

HenrikLantz commented 9 months ago

There are several folders that I doubt we will use. Not a big thing perhaps, but especially when hunting for files, the lower number of folders I need to look in, the better. Some that come to mind are: Assembly - canu, lja, verkko; raw data - Illumina - 10x; rawdata - bionano; data - Illumina - 10x; data - bionano. A connected question is, what do we do when there are new assemblers or types of data? Just add folders to the template?

mahesh-panchal commented 9 months ago

Well, thankfully folder structure is just a small part of the configuration file. When there's a new tool, we just add another path to where we want it stored regardless of how it's processed.

Normally the way I implement things is to have the data we want stored in a "results" folder in the backed up part of the project dir. The intermediate files are kept in the Nextflow work directory. However, the raw data and anything not processed by the workflow is often in a separate directory in the project directory. i.e. Usually it looks like this:

/proj/<project>
  | - analyses
  |  | - dated analysis
  |  \ - dated analysis 
  | - data
  |  | - raw data
  |  \ - meta data ( practically, other data that's not reads )
  | - docs
  | - results ( workflow outputs to save longer term - copied not symlinked normally)
  \ - workflow
/proj/<project>/nobackup/
  \ - nextflow work directory ( all intermediate files produced by all the workflow )

One thing I think might help is to have a "for_public_archival" folder, and a folder for the rest in the root of the results. Practically, it could just be a folder with symlinks into the rest folder so we have a logical folder structure within the rest folder, rather than having two places to look for results.

mahesh-panchal commented 9 months ago

Do we have a logical ordering on the folders we want output so we can use number prefixes?

E.g.

/proj/<project>/results/
  | - 01-Read_quality
  | - 02-Read_processing
  | - 03-Assembly
  | - ...

Or would we prefer something like

/proj/<project>/results/
  | - 01-FastQC
  | - 01-Fastk_DB
  | - 02-FastP
  | - 03-HiCanu
  | - 03-Verkko
  | - 03-HiFiAsm
  | - ...

Or other ideas?

mahesh-panchal commented 9 months ago

There's also a suggestion from the report here: https://github.com/NBISweden/assembly-project-template#directory-structure-specification

/
|-- assembly
|   |-- hifiasm
|   |-- hicanu
|   |-- canu
|   |-- lja
|   |-- ipa
|   |-- verkko
|   `-- flye
|-- rawdata
|   |-- illumina 
|       |-- 10x
|       |-- hic
|       |-- shotgun
|   |-- pacbio
|       |-- hifi
|       |-- isoseq
|       `-- lofi
|   |-- bionano
|   `-- ont
|-- data
|   |-- illumina 
|       |-- 10x
|       |-- hic
|       |-- shotgun
|   |-- pacbio
|       |-- hifi
|       |-- isoseq
|       `-- lofi
|   |-- bionano
|   `-- ont
|-- status
|-- reports
|-- scripts
|   `--pacbio_stats
`-- QC
    `-- pacbio
        `-- lofi
            |-- coverage
            `-- read_stats

Although I'm a little unclear where this goes.

MartinPippel commented 9 months ago

Hi would prefer this structure, to decrease the number of potential folders at the top level

/proj/<project>/results/
  | - 01-Read_quality
  | - 02-Read_processing
  | - 03-Assembly
  | - ...

sometimes multiple read processing steps (different QV-, length-cutoffs, or decontamination) is required, so I suggest to have a 02-Read_processing/FREEZE or similar folder name - to make it clear - which one was chosen for the assembly. A link to the data might be sufficient.

Similar for the assemblies.

Should we place the purge_dups steps at the same top level directory:
```
| - 04-Purging
| - 05-ErrorPolishing 
| - 06-ContaminationScreening
| - 07-Scaffolding 
| - 08-ManualCuration 
```
Do we include a separate directory for the mito assembly?

mahesh-panchal commented 9 months ago

I agree with having frozen tag to mark the folders we want to use. This is going to have to be a manual process though using symlinks to folders with versioning scheme. e.g.

03-Assembly/
  | - Hifiasm-haps-v1.0
  | - Hifiasm-haps-v2.0
  | - Hifiasm-haps-v3.0
  | - IPA-v1.0
  \ - Frozen-Assembly -> Hifiasm-haps-v2.0  # This symlink might be better off in a data directory outside of results.

We're also going to need a build version schema for when we run multiple settings. The input file supports a general string, but we're going to need a schema for this for when it's automated from tool settings and follow that manually too. We may need a argset id that corresponds to input tool settings such this can be used as a reference in the build.

Aside from the assemblies, what else will need a build schema?

Do we include a separate directory for the mito assembly?

Yes, we should have a separate directory for organelles. Perhaps 09-Organelles with subdirectories for mitochondria, chloroplast, plasmid, etc if they exist.

Also here is some background on Nextflow and implementing this structure.

Whenever Nextflow makes a run, it assigns the run a unique name, and computations are performed in uniquely id'ed subfolders inside a work directory. Nextflow implements a "process directive" known as publishDir that controls where outputs of a process should be placed once the process has finished running. This means that one can put certain files in another part of the system, rather than being stored in an unidentifiable working directory.

example configuration:

process {
    withName: 'HIFIASM' {
        publishDir = [
            [
                path: { "${params.outdir}/03-Assembly/HifiAsm-${meta.build_id}" }, // <- uses a closure to access task specific inputs
                mode: params.publish_mode
                // There are other options too like `enabled: <boolean>` that can use a conditional determining when to use this path
            ],
            [   // We can also publish files to multiple folders
                path: { "${params.outdir}/public_archival/03-Assembly/HifiAsm-${meta.build_id}" },
                mode: params.publish_mode
            ]
        ]
    }
}

The workflow input file so far looks like this, but this is up for discussion.

sample:
  id: 'HSapiens_test'
  kmer_size: 21
  ploidy: 2
assembly:
  - id: 'HS_phased_diploid'
    pri_fasta: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta'
    alt_fasta: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome2.fasta'
  - id: 'HS_consensus'
    pri_fasta: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome2.fasta'
hic:
  - read1: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_1.fastq.gz'
    read2: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_2.fastq.gz'
hifi:
  - reads: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/bam/test.paired_end.sorted.bam'
rnaseq:
  - read1: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_1.fastq.gz'
    read2: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_2.fastq.gz'
isoseq:
  - reads: 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/bam/test.paired_end.sorted.bam'

In the input above, other than sample the rest of the fields are optional. What needs to be implemented is adding tool configurations, perhaps like so:

tools:
    hifiasm:
        - id: 'argset01'
          args: '--opts X --opts Y'
        - id: 'argset02'
          args: '--opts A --opts B'

Some more musings from this.

Perhaps having a data directory is sufficient as a "for public archival" folder. The data directory was intended to be solely for workflow inputs.

Should the workflow only be run in stages, i.e. having an analyses folder like

analyses/
    | - 01-YYMMDD-Read_quality           # Seems straightforward - would like to run a set of tools with given settings
    | - 02-YYMMDD-Read_processing_c01    # This might be worth having different dirs for
    | - 02-YYMMDD-Read_processing_c02    # Might want to save settings from c01 and try a different order of contamination filtering in c02
    | - 03-YYMMDD-Assembly               # Adding tool settings here can be simply added to a file.

Build ID though should include reference to a change in tool processing path beforehand though, i.e. does the input data come from read_processing_c01, or read_processing_c02.
Changes to tool settings can be handled by adding a new record to the input file.

If we have options for different read processing protocols, each protocol would effectively need to be it's own subworkflow. This could then be part of the yaml input rather than marked as an separate analysis folder (i.e. this works well with version control - version control is not so useful when you want multiple versions simultaneously visible which is what the separate folders achieve).

suggestion:

stage:
    - read_processing_protocol: 
        - 'subsample_filter-vs-reference'
            tools:
                - toolA: '--max_coverage 40x'
                - toolB: '--reference asm_v2'
        - 'filter-vs-reference_subsample'
                - toolA: '--max_coverage 50x'
                - toolB: '--reference asm_v1.1'

Feedback loops are not well implemented in Nextflow yet to allow custom tool chaining like this.

mahesh-panchal commented 9 months ago

More musings on workflow usage scenarios that will affect the outcome of folder structures.

Use cases:

Run assembly from start to end using reads only.
Run assembly altering tool chain ( i.e. select a different workflow option ), e.g., change how reads are processed.
Run assembly adding runs of tools with alternate parameters ( i.e. updating git tracked parameter file ). e.g., run purge dups with different cutoff settings.
Run assembly from midway with selected assembly.

Suggestion in working protocol:

Case 1. is catered for by default with full analysis folder. Results are put in results.

Case 2. is potentially complex. This scenario means workflow tool chains need to be hard coded as distinct subworkflows and requires an option to select them. Need output to reflect there is a difference in tool chain. Can make as element of Build ID. Workflow needs to have option to select which stages to execute and which path to take within that stage.

Case 3. parameter sets can be encoded by a number with a look up table.

Case 4. Assembly symlinked to data (i.e. input data folder). Requires new analysis folder to keep original analysis path visible. Parameter file selects stage to resume from and parameter sets to continue with. Where does the starting part of the build ID come from, i.e. the string generated by previous analysis runs? If buildID is part of filename then should be ok.

Workflow needs to print a table of what the short-hand build id translates to. E.g.

Build ID suggestion: rp1.01.hifiasm-haps01.pd01.ep01.cs01.sc01.rc01 which means

rp1.01: read processing subworkflow 1, parameter set 01.
hifiasm-haps01: assembler = hifiasm, haps = haplotypes 1 and 2 output / cons=consensus, parameter set 01.
pd01: purging parameter set 01.
ep01: error polishing parameter set 01.
cs01: contamination screen parameter set 01.
rc01: rapid curation parameter set 01.

Use a versioning system where major number corresponds to a subworkflow, minor number corresponds to a parameter set. Make ID easier to read by using a short section code.

LucileSol commented 9 months ago

What I think is needed for the annotation step is :

report of the assembly (genome and plasmid)
report of the sequencing (genome+RNAseq+isoseq)
to know what was delivered to the PI and where it was delivered (genome, plasmid...) (Maybe more in the report issue)
folder/fasta file of the genome assembly
folder/fasta file of mito/plasmid if there are any
folder for RNAseq data
folder for isoseq data

For the RNAseq data we need to know which tissue/stage/... correspond to which files and what kit was used (Maybe more in the report issue).

stephan-nylinder commented 9 months ago

There is a certain sense in trying to stick close to the predefined folder structure of NGI facilities for data deliveries (which may also differ somewhat across facilities). IMHO it might be a bit too cluttered to be useful here. Would be nice to maintain some core structure to make transitions smooth without too much risk of information loss. In general I would like separate folders similar to previously suggested:

But with the proposal of adding an extra folder for intermediate files (or use analyses as such), separating true raw from processed data, and separating raw data into subfolders by data types. An additional folder for documentation and relevant reports would also be good to keep close by. A simple structure easy to navigate. Folder for raw data host subfolders for individual sample data. Trying to keep folder depth low is a priority, though not always achievable.

mahesh-panchal commented 9 months ago

There is a certain sense in trying to stick close to the predefined folder structure of NGI facilities for data deliveries (which may also differ somewhat across facilities).

What's the folder structure you would like replicated from NGI's folder structure? Which of those files they deliver are important to you? I would like to implement something to take their delivery and put key files in a data folder. One model I was considering is making the data folder like:

/data/
  | - deliveries        ( compressed archive of the delivery, i.e. nothing changed, just compressed as tar.bz2 or something ) 
  | - raw-data          ( the read data we need in subfolders named after the platform and library type  e.g. PacBio-Revio-WGS )
  | - finalized         ( generally symlinks back to the subfolders in /results folder )
    | - processed-reads ( symlinks to filtered, decontaminated read folder under /results )
    | - assembly 
        | - pre-duplication-purge ( symlink to specific assembly folder in /results, e.g. /results/assembly/hifiasm-build-version/ )
        | - post-deduplication-purge 
        | - post-scaffolding
        | - post-curation
    \ - Report ( symlink to report folder in results )

The /data/finalized/ would be selected data that go on for further processing, leaving all the exploratory work in /results ( so data/finalized for selected data, results for all parameter explorations but only files of interest, and nobackup/work/ for everything including auxiliary files of no interest). My initial thought would be that /data would be the folder you could use to retrieve the data for public archiving. The finalized folder may grow depending on our analysis needs though, so there's a question if it becomes too cluttered to use also use for public archiving.

But with the proposal of adding an extra folder for intermediate files (or use analyses as such), separating true raw from processed data,

The analyses folders are only launch scripts in my view. Does the finalized folder cover what you mean by intermediates? Are there certain key terms that would help in navigation?

and separating raw data into subfolders by data types.

Could you clarify your thought here please?

An additional folder for documentation and relevant reports would also be good to keep close by.

Which documents and reports do you need? Are there certain documents you would like in the same folder as certain files?

A simple structure easy to navigate. Folder for raw data host subfolders for individual sample data. Trying to keep folder depth low is a priority, though not always achievable.

I need more details here, particularly if there are desirables because of earlier problems encountered.

stephan-nylinder commented 9 months ago

There is a certain sense in trying to stick close to the predefined folder structure of NGI facilities for data deliveries (which may also differ somewhat across facilities).

What's the folder structure you would like replicated from NGI's folder structure? Which of those files they deliver are important to you?

I need to take this question back to the DM group for a it more discussion. I believe you suggestion with a /data/deliveries folder might suffice though some data will be duplicated between /deliveries and /raw-data, which is not a major problem but increases storage load.

I would like to implement something to take their delivery and put key files in a data folder. One model I was considering is making the data folder like:

/data/
  | - deliveries        ( compressed archive of the delivery, i.e. nothing changed, just compressed as tar.bz2 or something ) 
  | - raw-data          ( the read data we need in subfolders named after the platform and library type  e.g. PacBio-Revio-WGS )
  | - finalized         ( generally symlinks back to the subfolders in /results folder )
    | - processed-reads ( symlinks to filtered, decontaminated read folder under /results )
    | - assembly 
        | - pre-duplication-purge ( symlink to specific assembly folder in /results, e.g. /results/assembly/hifiasm-build-version/ )
        | - post-deduplication-purge 
        | - post-scaffolding
        | - post-curation
    \ - Report ( symlink to report folder in results )
The /data/finalized/ would be selected data that go on for further processing, leaving all the exploratory work in /results ( so data/finalized for selected data, results for all parameter explorations but only files of interest, and nobackup/work/ for everything including auxiliary files of no interest). My initial thought would be that /data would be the folder you could use to retrieve the data for public archiving. The finalized folder may grow depending on our analysis needs though, so there's a question if it becomes too cluttered to use also use for public archiving.

A bit concerned about the use and meaning of the term "finalized". I would prefer a non-endpoint-indicating term like "selected", "intermediate", or similar.

But with the proposal of adding an extra folder for intermediate files (or use analyses as such), separating true raw from processed data,

The analyses folders are only launch scripts in my view. Does the finalized folder cover what you mean by intermediates? Are there certain key terms that would help in navigation?

See above. Any solution where I as DS can easily see that the contents are for you Bioinformaticians only. Raw data being easily findable, and repository objects like assemblies also, though not necessarily in the same folder.

and separating raw data into subfolders by data types.

Could you clarify your thought here please?

Raw data will in most be more than one datatype, e.g. PacBio, HiC, rna-seq etc. Lumping everything together in a single folder is a bit blunt. Subdivision into raw-data/HiC, raw-data/PacBio etc makes sense.

An additional folder for documentation and relevant reports would also be good to keep close by.

Which documents and reports do you need? Are there certain documents you would like in the same folder as certain files?

I would like e.g. the facility report being stored with the data for us to have close at hand. It can help quite a bit when submitting data. Any kind of data associated document or documentation goes here.

A simple structure easy to navigate. Folder for raw data host subfolders for individual sample data. Trying to keep folder depth low is a priority, though not always achievable.

I need more details here, particularly if there are desirables because of earlier problems encountered.

I have seen cases where data is kept in folder systems where there are one or more single folder levels. For example, data/PacBio/example/files/file1.fastq where the example folder only contains the files folder. Cutting down on path lenghts will allow easier overview. Makes sense?

mahesh-panchal commented 9 months ago

I believe you suggestion with a /data/deliveries folder might suffice though some data will be duplicated between /deliveries and /raw-data, which is not a major problem but increases storage load.

I thought about this a bit more and I agree. I don't see taring the delivery folders to bring much benefit. This means that raw-data could instead contain symlinks to files in the delivery folders maintained by a script thereby making a nicer folder structure to access data without increasing the storage load too much.

A bit concerned about the use and meaning of the term "finalized". I would prefer a non-endpoint-indicating term like "selected", "intermediate", or similar.

For us, these would be end-points of particular stages of the assembly process. "Selected" would also work, but "intermediate" would be confusing. "Frozen" or similar was also suggested above.

I think also renaming results to results-intermediate and putting it in the data directory would better identify it's purpose.

/data
  | - deliveries (moved from INBOX and made read-only)
  | - raw-data (subfolders named after library-type, symlinked to files in deliveries, managed by a script for data integrity)
  | - results-intermediate (workflow/script outputs from exploratory phase)
  \ - finalized (symlinked folders to results-intermediate marking assembly phase end-points)

Raw data will in most be more than one datatype, e.g. PacBio, HiC, rna-seq etc. Lumping everything together in a single folder is a bit blunt. Subdivision into raw-data/HiC, raw-data/PacBio etc makes sense.

I think we're on the same page here. I proposed that the raw data would be in subfolders (e.g., PacBio-Revio-WGS) so we know what library types we have, but without all the confusing meaningless (to us) internal codes.

I have seen cases where data is kept in folder systems where there are one or more single folder levels. For example, data/PacBio/example/files/file1.fastq where the example folder only contains the files folder. Cutting down on path lenghts will allow easier overview. Makes sense?

Yep. I dislike unnecessary navigation too.

aersoares81 commented 9 months ago

Build ID suggestion: rp1.01.hifiasm-haps01.pd01.ep01.cs01.sc01.rc01 which means

rp1.01: read processing subworkflow 1, parameter set 01.

hifiasm-haps01: assembler = hifiasm, haps = haplotypes 1 and 2 output / cons=consensus, parameter set 01.

pd01: purging parameter set 01.

ep01: error polishing parameter set 01.

cs01: contamination screen parameter set 01.

rc01: rapid curation parameter set 01.

Use a versioning system where major number corresponds to a subworkflow, minor number corresponds to a parameter set. Make ID easier to read by using a short section code.

I've been thinking about this, and I'm afraid that including all possible options in the build ID might make it hard to read and find the differences between them. I was thinking maybe defining the first set of options as a single version and then attaching only the alternative parameter in the build. In this case instead of rp1.01.hifiasm-haps01.pd01.ep01.cs01.sc01.rc01 it would be something like build1 for this first version of parameters set (default set?), and then let's say you use a different set of purging parameters, this new build would be build1.pd02, indicating it's default with that one change.

mahesh-panchal commented 9 months ago

I've been thinking about this, and I'm afraid that including all possible options in the build ID might make it hard to read and find the differences between them. I was thinking maybe defining the first set of options as a single version and then attaching only the alternative parameter in the build. In this case instead of rp1.01.hifiasm-haps01.pd01.ep01.cs01.sc01.rc01 it would be something like build1 for this first version of parameters set (default set?), and then let's say you use a different set of purging parameters, this new build would be build1.pd02, indicating it's default with that one change.

Whichever way we choose, we're going to need a lookup table (and make sure that's front and center on the appropriate page in the report). I like the shorthand of using "presets", but then one can't tell at a glance which stage a file is from. For example, a prescaffolding stage assembly shouldn't have the some of the end parts. On the other hand we will be following a strict informative folder structure for where that file is stored, so having it in the name is potentially redundant.

aersoares81 commented 9 months ago

Maybe we could stablish that once it reaches a certain stage it gets named buil1 to simplify at the end to make it easier for humans to read. If you're going to try a new parameter set for one of the tools it will happen after you've seen the end result of the first run (unless you stop it prematurely?), no? So the preset idea could work as a way to make it easier. I guess I'm mostly afraid that we add more tools and soon we'll have a build/file name with ten little appendages to the name, making it really confusing.

mahesh-panchal commented 9 months ago

The important point really is what stage it comes from, so we only actually need the last part of that build string proposition right? After the initial run, we will likely choose which parts should be frozen ( linked in finalized or whatever name we choose ), and continue from a frozen input. Then the name only needs to record what stage it's at right?

gbdias commented 9 months ago

Imo files to be submitted to public databases should always be symlinked to a "files_to_submit" folder with the necessary/available metadata.
We would basically fill the ENA sample/study/reads template tsv files with all info including file paths and md5sum. This way the submitter only has to go to this one place and the rest of the folder structure doesn't matter to them.

mahesh-panchal commented 9 months ago

Imo files to be submitted to public databases should always be symlinked to a "files_to_submit" folder with the necessary/available metadata.

We would basically fill the ENA sample/study/reads template tsv files with all info including file paths and md5sum. This way the submitter only has to go to this one place and the rest of the folder structure doesn't matter to them.

Initially this is what I was thinking too, but I think for now, we should just get started with a structure, and let it evolve. I'm thinking at this stage that we now record the how we do things on the assembly template gh-pages, and we can update the closing project protocol with actual steps when we get to it ( which is where I see Stephan doing the assembly upload - although the read upload could be an earlier step ).

MartinPippel commented 8 months ago

In which folder should we put the assemblies that we did not do - e.g. from NGI or NCBI?

data/
├── deliveries
│   ├── pt_141 ...
│   ├── pt_153 ...
├── frozen
├── outputs
├── raw-data
│   ├── PacBio-HiFi-ISOSEQ
│   ├── PacBio-HiFi-WGS
├── processed-data
│   ├── NGI_ASM
│   ├── NCBI_ACCESSION_X
└── README.md

mahesh-panchal commented 8 months ago

I would say data/raw-data/{public,NGI}-assemblies With a download script for public-assemblies case