Define folder structures and implement data versioning

shntnu commented 4 years ago

We want to address two issues here

define a new folder structure for profiling experiments
identify which of the components will be version controlled.

I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.

This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under workspace) the only discrepancy is metadata/platemaps (see #70); consensus and collated are currently missing in the Gallery, but that is not a discrepancy per se.

This is the proposed folder structure in the Profiling Handbook:

├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           ├── SQ00015167_normalized_feature_select.csv
│           └── SQ00015167_spherized.csv
├── collated (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── consensus (*)
│   └── 2016_04_01_a549_48hr_batch1
│       ├── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       ├── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_spherized.parquet
├── backend
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           └── SQ00015167.sqlite 
├── load_data_csv
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── load_data.csv
│           └── load_data_with_illum.csv
├── log 
├── metadata
│   └── 2016_04_01_a549_48hr_batch1
│       ├── barcode_platemap.csv
│       └── platemap
│           └── C-7161-01-LM6-006.txt
└── pipelines

* collated and consensus files are saved as parquet to allow fast loading.

We will version these folders by placing them inside the project repo

folder	generator
profiles	pycytominer
collated	pycytominer
consensus	pycytominer
load_data_csv	pe2loaddata
log	GNU parallel (when running various commands)
metadata	manual
pipelines	manual

We will not version these folders:

folder	generator	reason
backend	cytominer-database
analysis	CellProfiler, Distributed-CellProfiler	redundant with SQLite backend
images	Microscope	Never changes, and too big!

shntnu commented 4 years ago

I propose we split backend into ~~backend~~ single_cell and profiles.

single_cell will have only Level 2b i.e. the SQLite / Parquet file
profiles will have Level 3 upwards

├── single_cell
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           └── SQ00015167.sqlite 
├── profiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── SQ00015167.csv
│           ├── SQ00015167_augmented.csv
│           ├── SQ00015167_normalized.csv
│           └── SQ00015167_normalized_variable_selected.csv

This is for two reasons

File size: SQLite is ~300 times larger that the rest of the files combined. Keeping this large file in a separate folder structure will make maintaining data easier.
Frequency of access: SQLite is not touched as often as the the downstream data, at least not so far.

SQLite files should likely not be versioned given the file size. Instead we should store a hyperlink to their location on S3 or some other permanent storage (like Figshare)

shntnu commented 4 years ago

For CSV files: I'd love to figure out some way of storing metadata along with the file, but I haven't found a satisfactory approach.

ExpressionSet is useful but very domain-specific and not text-based
GCT is text-based but again very domain-specific

We could consider something simple like have a comment in the CSV file which is a link to the commit that was used to generate the CSV. e.g.:

## source:https://github.com/foo/bar/tree/223e1f5566fab7d20048bab5b5008bd91c005ef9
col1,col2,col3
1,4,2
5,7,3,
…

Other relevant links

shntnu commented 4 years ago

I propose we split backend into backend and profiles.

backend will have only Level 2b i.e. the SQLite / Parquet file – this was in fact my original intention (and thus the name :D)

profiles will have Level 3 upwards

@gwaygenomics did you see this? Does that work? (they are at the same level)

gwaybio commented 4 years ago

just saw it now - yes it can work. Any thought to renaming backend? If all that lives there is going to be SQLite/Parquet then isn't single_cell_profiles (or just single_cell) better?

shntnu commented 4 years ago

single_cell sounds good to me.

shntnu commented 3 years ago

I added this

├── collated
│   └── 2016_04_01_a549_48hr_batch1
│       └── 2016_04_01_a549_48hr_batch1_augmented.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized.parquet
│       └── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet

and dropped all cytotools as a data generator; only pycytominer going forward.

shntnu commented 2 years ago

I dropped batchfiles and audit because we no longer produce these

├── batchfiles
│   └── 2016_04_01_a549_48hr_batch1
│       └── SQ00015167
│           ├── analysis
│           │   ├── Batch_data.h5
│           │   ├── dcp_config.json
│           │   ├── cp_docker_commands.txt
│           │   └── cpgroups.csv
│           └── illum
│               ├── Batch_data.h5
│               ├── dcp_config.json
│               ├── cp_docker_commands.txt
│               └── cpgroups.csv

├── audit 
│    └── 2016_04_01_a549_48hr_batch1
│       ├── C-7161-01-LM6-006_audit.csv
│       └── C-7161-01-LM6-006_audit_detailed.csv

I renamed single_cell to backend because that became the de facto standard via JUMP (although I wish had gone with single_cell; I lost track of this discussion), and moved SQ00015167.csv to backend (from profiles)

cytomining / profiling-handbook

Define folder structures and implement data versioning #54