cytomining / profiling-handbook

Image-based Profiling Handbook
https://cytomining.github.io/profiling-handbook/
Creative Commons Zero v1.0 Universal
9 stars 8 forks source link

Define folder structures and implement data versioning #54

Open shntnu opened 4 years ago

shntnu commented 4 years ago

We want to address two issues here

  1. define a new folder structure for profiling experiments
  2. identify which of the components will be version controlled.

I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.

This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under workspace) the only discrepancy is metadata/platemaps (see #70); consensus and collated are currently missing in the Gallery, but that is not a discrepancy per se.

This is the proposed folder structure in the Profiling Handbook:

├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           ├── SQ00015167_normalized_feature_select.csv
│           └── SQ00015167_spherized.csv
├── collated (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── consensus (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── backend
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           └── SQ00015167.sqlite 
├── load_data_csv
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── load_data.csv
│           └── load_data_with_illum.csv
├── log 
├── metadata
│   └── 2016_04_01_a549_48hr_batch1
│       ├── barcode_platemap.csv
│       └── platemap
│           └── C-7161-01-LM6-006.txt
└── pipelines

* collated and consensus files are saved as parquet to allow fast loading.

We will version these folders by placing them inside the project repo

folder generator
profiles pycytominer
collated pycytominer
consensus pycytominer
load_data_csv pe2loaddata
log GNU parallel (when running various commands)
metadata manual
pipelines manual

We will not version these folders:

folder generator reason
backend cytominer-database
analysis CellProfiler, Distributed-CellProfiler redundant with SQLite backend
images Microscope Never changes, and too big!
shntnu commented 4 years ago

I propose we split backend into backend single_cell and profiles.

├── single_cell
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           └── SQ00015167.sqlite 
├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           └── SQ00015167_normalized_variable_selected.csv

This is for two reasons

SQLite files should likely not be versioned given the file size. Instead we should store a hyperlink to their location on S3 or some other permanent storage (like Figshare)

shntnu commented 4 years ago

For CSV files: I'd love to figure out some way of storing metadata along with the file, but I haven't found a satisfactory approach.

We could consider something simple like have a comment in the CSV file which is a link to the commit that was used to generate the CSV. e.g.:

## source:https://github.com/foo/bar/tree/223e1f5566fab7d20048bab5b5008bd91c005ef9
col1,col2,col3
1,4,2
5,7,3,
…

Other relevant links

shntnu commented 4 years ago

I propose we split backend into backend and profiles.

  • backend will have only Level 2b i.e. the SQLite / Parquet file – this was in fact my original intention (and thus the name :D)
  • profiles will have Level 3 upwards

@gwaygenomics did you see this? Does that work? (they are at the same level)

gwaybio commented 4 years ago

just saw it now - yes it can work. Any thought to renaming backend? If all that lives there is going to be SQLite/Parquet then isn't single_cell_profiles (or just single_cell) better?

shntnu commented 4 years ago

single_cell sounds good to me.

shntnu commented 3 years ago

I added this

├── collated
│   └── 2016_04_01_a549_48hr_batch1
│       └── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet

and dropped all cytotools as a data generator; only pycytominer going forward.

shntnu commented 2 years ago

I dropped batchfiles and audit because we no longer produce these

├── batchfiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── analysis
│           │   ├── Batch_data.h5
│           │   ├── dcp_config.json
│           │   ├── cp_docker_commands.txt
│           │   └── cpgroups.csv
│           └── illum
│               ├── Batch_data.h5
│               ├── dcp_config.json
│               ├── cp_docker_commands.txt
│               └── cpgroups.csv
├── audit 
│    └── 2016_04_01_a549_48hr_batch1
│       ├── C-7161-01-LM6-006_audit.csv
│       └── C-7161-01-LM6-006_audit_detailed.csv

I renamed single_cell to backend because that became the de facto standard via JUMP (although I wish had gone with single_cell; I lost track of this discussion), and moved SQ00015167.csv to backend (from profiles)