Closed murraycutforth closed 8 months ago
I've started work on this. I realised that the identity spreadsheet (listed above) has multiple rows for each (plate, row, column) location since each mutant can have multiple mutations, and so it might actually make sense to store this information as separate tables in a database (otherwise you end up with redundant repetitions of the data). I've set up the code as it stands to write out separate csv files, as well as a single sqlite database containing 2 tables. Open to suggestions on how we structure it though!
Adrien sent us an updated version of the spreadsheet with the relevant columns sorted towards the front.
And we discussed what to do with the duplicates in our last meeting, I believe, but will verify once I get the chance to review my notes
The spreadsheet link above and the columns are from Adrien's email on Dec 21st, which I think is the latest update I've seen. I'm just pretty sure that there are other columns we mentioned at the time and which I've forgotten, so yeah would be good to see your notes on the columns!
Columns for the data table:
Thanks, super helpful! Glad someone was taking notes..
As I understand it, in rare cases more than one gene will have been hit, so perhaps we need to have something like this? (To handle up to 3 different gene insertions for a given mutant. And since we have 6 growth conditions for each mutant, we can't strictly index only on mutant ID. A combo of mutant ID, growth condition, and date (in case we need to do repeats) could be a good way to uniquely identify rows?
Gene_1 | Feature_1 | Confidence_1 | Gene_2 | Feature_2 | Confidence_2 | Gene_3 | Feature_3 | Confidence_3 |
---|---|---|---|---|---|---|---|---|
value_1_1 | value_1_2 | value_1_3 | value_2_1 | value_2_2 | value_2_3 | value_3_1 | value_3_2 | value_3_3 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
I would also propose keeping gene descriptions in a separate table to experiment data? Some of them are very long, and it significantly increases the size of the table. And presumably we won't want to use it in most analysis.
Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat:
- image_features: contains features extracted from the images, such as Fv/Fm, Y2, NPQ, along with experimental information such as temperatures
- identity: contains information about the identity of each mutant, such as well location, plate number, etc.
- mutations: contains information about the mutations in each mutant, such as disrupted gene name, type, confidence level, etc.
- gene_descriptions: contains lengthy descriptions of each gene
There are still a bunch of experiment data not included, which should be added to the image_features table:
TODO: add remaining experimental columns:
- Start time and date
- Was there an issue?
- Temperature under camera (avg, max, min)
- Temperature in algae house
- # days M plate grown
- # days S plate grown
- Other quantifiers of fluorescence or shape heterogeneity
As I understand it, in rare cases more than one gene will have been hit, so perhaps we need to have something like this? (To handle up to 3 different gene insertions for a given mutant. And since we have 6 growth conditions for each mutant, we can't strictly index only on mutant ID. A combo of mutant ID, growth condition, and date (in case we need to do repeats) could be a good way to uniquely identify rows?
mutant IDs (the LMJ names) are unique and correspond 1-to-1 with unique well locations (with the exception of wild types which are replicated) in this experiment. Mutant IDs correspond to the name of a strain that was kept in a "biobank" somewhere (many of the strains are from collaborators in Minnesota). Ultimately and scientifically, we want to understand what each gene does, but in practice the strains we are using aren't ideal/clean knockouts so we measure the physiology of each strain/well.
I think it's important to have the initial database reflect the fact that the data we are collecting is by well/strain.
Since in future experiments, we may collect replicated measurements of the same mutants, I think the unique ID should be the well-location (ie [plate #]-[plate location])
I would also propose keeping gene descriptions in a separate table to experiment data? Some of them are very long, and it significantly increases the size of the table. And presumably we won't want to use it in most analysis.
Sounds like a good idea
Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat
Keeping the different steps for the data-munging of each of the categories is good, for clarity, but keeping 4 separate dataframes adds complexity. With parquet, we can also selectively read columns from the saved file. And the other dataframes asides from image_features
are rather small and we don't know exactly in what combination we will be using the data yet, so I'm going to save the different tables as one, with the exception of the gene descriptions.
Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat
I agree with the goal: we should produce a data table indexed by mutated gene, but we need to do the analysis first that will let us get there
We want to create a database (in practice, could be a pandas dataframe which is stored as a csv) which all data analysis is based on.
This means we can decouple the image processing from the data analysis.
The database creation step should be straightforward to re-run, and update the table as further data comes in.
We discussed the columns which should go into this database at our meeting on 12/20/2023. I think @codercahol you have a note of the columns we agreed on, would you be able to post that here for reference?
Information on mutant identities is in: https://docs.google.com/spreadsheets/d/1_UcLC4jbI04Rnpt2vUkSCObX8oUY6mzl/edit#gid=206647583
From this spreadsheet, we can get: