Open shntnu opened 4 years ago
I propose we split backend
into backend
single_cell
and profiles
.
single_cell
will have only Level 2b i.e. the SQLite / Parquet file profiles
will have Level 3 upwards├── single_cell
│ └── 2016_04_01_a549_48hr_batch1
│ └── SQ00015167
│ └── SQ00015167.sqlite
├── profiles
│ └── 2016_04_01_a549_48hr_batch1
│ └── SQ00015167
│ ├── SQ00015167.csv
│ ├── SQ00015167_augmented.csv
│ ├── SQ00015167_normalized.csv
│ └── SQ00015167_normalized_variable_selected.csv
This is for two reasons
SQLite files should likely not be versioned given the file size. Instead we should store a hyperlink to their location on S3 or some other permanent storage (like Figshare)
For CSV files: I'd love to figure out some way of storing metadata along with the file, but I haven't found a satisfactory approach.
We could consider something simple like have a comment in the CSV file which is a link to the commit that was used to generate the CSV. e.g.:
## source:https://github.com/foo/bar/tree/223e1f5566fab7d20048bab5b5008bd91c005ef9
col1,col2,col3
1,4,2
5,7,3,
…
Other relevant links
I propose we split
backend
intobackend
andprofiles
.
backend
will have only Level 2b i.e. the SQLite / Parquet file – this was in fact my original intention (and thus the name :D)profiles
will have Level 3 upwards
@gwaygenomics did you see this? Does that work? (they are at the same level)
just saw it now - yes it can work. Any thought to renaming backend
? If all that lives there is going to be SQLite/Parquet
then isn't single_cell_profiles
(or just single_cell
) better?
single_cell
sounds good to me.
I added this
├── collated
│ └── 2016_04_01_a549_48hr_batch1
│ └── 2016_04_01_a549_48hr_batch1_augmented.parquet
│ └── 2016_04_01_a549_48hr_batch1_normalized.parquet
│ └── 2016_04_01_a549_48hr_batch1_normalized_feature_select.parquet
and dropped all cytotools
as a data generator; only pycytominer
going forward.
I dropped batchfiles
and audit
because we no longer produce these
├── batchfiles
│ └── 2016_04_01_a549_48hr_batch1
│ └── SQ00015167
│ ├── analysis
│ │ ├── Batch_data.h5
│ │ ├── dcp_config.json
│ │ ├── cp_docker_commands.txt
│ │ └── cpgroups.csv
│ └── illum
│ ├── Batch_data.h5
│ ├── dcp_config.json
│ ├── cp_docker_commands.txt
│ └── cpgroups.csv
├── audit
│ └── 2016_04_01_a549_48hr_batch1
│ ├── C-7161-01-LM6-006_audit.csv
│ └── C-7161-01-LM6-006_audit_detailed.csv
I renamed single_cell
to backend
because that became the de facto standard via JUMP (although I wish had gone with single_cell
; I lost track of this discussion), and moved SQ00015167.csv
to backend
(from profiles
)
We want to address two issues here
I will update this comment periodically as the strategy evolves. I realize this is not ideal because it upsets the chronology of discussions.
This is our current folder structure specified in the Profiling Handbook. This differs slightly from the folder structure specified in the Cell Painting Gallery. For this level of nesting (under
workspace
) the only discrepancy ismetadata/platemaps
(see #70);consensus
andcollated
are currently missing in the Gallery, but that is not a discrepancy per se.This is the proposed folder structure in the Profiling Handbook:
* collated and consensus files are saved as parquet to allow fast loading.
We will version these folders by placing them inside the project repo
We will not version these folders: