codercahol / chlamy-ImPi

An image processing pipeline for time-series of Chlamydomonas reinhardtii fluorescence photos
Other
0 stars 0 forks source link

Database creation #7

Closed murraycutforth closed 8 months ago

murraycutforth commented 8 months ago

We want to create a database (in practice, could be a pandas dataframe which is stored as a csv) which all data analysis is based on.

This means we can decouple the image processing from the data analysis.

The database creation step should be straightforward to re-run, and update the table as further data comes in.

We discussed the columns which should go into this database at our meeting on 12/20/2023. I think @codercahol you have a note of the columns we agreed on, would you be able to post that here for reference?

Information on mutant identities is in: https://docs.google.com/spreadsheets/d/1_UcLC4jbI04Rnpt2vUkSCObX8oUY6mzl/edit#gid=206647583

From this spreadsheet, we can get:

murraycutforth commented 8 months ago

I've started work on this. I realised that the identity spreadsheet (listed above) has multiple rows for each (plate, row, column) location since each mutant can have multiple mutations, and so it might actually make sense to store this information as separate tables in a database (otherwise you end up with redundant repetitions of the data). I've set up the code as it stands to write out separate csv files, as well as a single sqlite database containing 2 tables. Open to suggestions on how we structure it though!

codercahol commented 8 months ago

Adrien sent us an updated version of the spreadsheet with the relevant columns sorted towards the front.

And we discussed what to do with the duplicates in our last meeting, I believe, but will verify once I get the chance to review my notes

murraycutforth commented 8 months ago

The spreadsheet link above and the columns are from Adrien's email on Dec 21st, which I think is the latest update I've seen. I'm just pretty sure that there are other columns we mentioned at the time and which I've forgotten, so yeah would be good to see your notes on the columns!

codercahol commented 8 months ago

Columns for the data table:

murraycutforth commented 8 months ago

Thanks, super helpful! Glad someone was taking notes..

As I understand it, in rare cases more than one gene will have been hit, so perhaps we need to have something like this? (To handle up to 3 different gene insertions for a given mutant. And since we have 6 growth conditions for each mutant, we can't strictly index only on mutant ID. A combo of mutant ID, growth condition, and date (in case we need to do repeats) could be a good way to uniquely identify rows?

Gene_1 Feature_1 Confidence_1 Gene_2 Feature_2 Confidence_2 Gene_3 Feature_3 Confidence_3
value_1_1 value_1_2 value_1_3 value_2_1 value_2_2 value_2_3 value_3_1 value_3_2 value_3_3
... ... ... ... ... ... ... ... ...

I would also propose keeping gene descriptions in a separate table to experiment data? Some of them are very long, and it significantly increases the size of the table. And presumably we won't want to use it in most analysis.

murraycutforth commented 8 months ago

Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat:

- image_features: contains features extracted from the images, such as Fv/Fm, Y2, NPQ, along with experimental information such as temperatures
- identity: contains information about the identity of each mutant, such as well location, plate number, etc.
- mutations: contains information about the mutations in each mutant, such as disrupted gene name, type, confidence level, etc.
- gene_descriptions: contains lengthy descriptions of each gene

There are still a bunch of experiment data not included, which should be added to the image_features table:

TODO: add remaining experimental columns:
 - Start time and date
 - Was there an issue?
 - Temperature under camera (avg, max, min)
 - Temperature in algae house
 - # days M plate grown
 - # days S plate grown
 - Other quantifiers of fluorescence or shape heterogeneity
codercahol commented 8 months ago

As I understand it, in rare cases more than one gene will have been hit, so perhaps we need to have something like this? (To handle up to 3 different gene insertions for a given mutant. And since we have 6 growth conditions for each mutant, we can't strictly index only on mutant ID. A combo of mutant ID, growth condition, and date (in case we need to do repeats) could be a good way to uniquely identify rows?

mutant IDs (the LMJ names) are unique and correspond 1-to-1 with unique well locations (with the exception of wild types which are replicated) in this experiment. Mutant IDs correspond to the name of a strain that was kept in a "biobank" somewhere (many of the strains are from collaborators in Minnesota). Ultimately and scientifically, we want to understand what each gene does, but in practice the strains we are using aren't ideal/clean knockouts so we measure the physiology of each strain/well.

I think it's important to have the initial database reflect the fact that the data we are collecting is by well/strain.

Since in future experiments, we may collect replicated measurements of the same mutants, I think the unique ID should be the well-location (ie [plate #]-[plate location])

codercahol commented 8 months ago

I would also propose keeping gene descriptions in a separate table to experiment data? Some of them are very long, and it significantly increases the size of the table. And presumably we won't want to use it in most analysis.

Sounds like a good idea

codercahol commented 8 months ago

Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat

Keeping the different steps for the data-munging of each of the categories is good, for clarity, but keeping 4 separate dataframes adds complexity. With parquet, we can also selectively read columns from the saved file. And the other dataframes asides from image_features are rather small and we don't know exactly in what combination we will be using the data yet, so I'm going to save the different tables as one, with the exception of the gene descriptions.

codercahol commented 8 months ago

Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat

I agree with the goal: we should produce a data table indexed by mutated gene, but we need to do the analysis first that will let us get there