Database creation - Githubissues

murraycutforth commented 8 months ago

We want to create a database (in practice, could be a pandas dataframe which is stored as a csv) which all data analysis is based on.

This means we can decouple the image processing from the data analysis.

The database creation step should be straightforward to re-run, and update the table as further data comes in.

We discussed the columns which should go into this database at our meeting on 12/20/2023. I think @codercahol you have a note of the columns we agreed on, would you be able to post that here for reference?

Information on mutant identities is in: https://docs.google.com/spreadsheets/d/1_UcLC4jbI04Rnpt2vUkSCObX8oUY6mzl/edit#gid=206647583

From this spreadsheet, we can get:

strain name (A)
gene number (D)
Plate index (B and C)
feature (E)
confidence level (Z)
description (AO)

murraycutforth commented 8 months ago

I've started work on this. I realised that the identity spreadsheet (listed above) has multiple rows for each (plate, row, column) location since each mutant can have multiple mutations, and so it might actually make sense to store this information as separate tables in a database (otherwise you end up with redundant repetitions of the data). I've set up the code as it stands to write out separate csv files, as well as a single sqlite database containing 2 tables. Open to suggestions on how we structure it though!

codercahol commented 8 months ago

Adrien sent us an updated version of the spreadsheet with the relevant columns sorted towards the front.

And we discussed what to do with the duplicates in our last meeting, I believe, but will verify once I get the chance to review my notes

murraycutforth commented 8 months ago

The spreadsheet link above and the columns are from Adrien's email on Dec 21st, which I think is the latest update I've seen. I'm just pretty sure that there are other columns we mentioned at the time and which I've forgotten, so yeah would be good to see your notes on the columns!

codercahol commented 8 months ago

Columns for the data table:

index: mutant ID (LMJ name)
gene name (for double mutants, incl both names, for WT record location on plate)
num. days M plate grown
num. days S plate grown
light condition
mask area
threshold(s)
[other parameters quantifying fluorescence or shape heterogeneity?]
plate ID
M-(number) ID
inferred YII series (timeseries of the 'Fv/Fm' measurements)
inferred NPQ series
temperature under camera (series?, average?, max/min?)
temp in algae house (average over M plate growth? over all 18 days of growth?)
start time and date
num. genes mutated
was there an issue (y/n)

murraycutforth commented 8 months ago

Thanks, super helpful! Glad someone was taking notes..

As I understand it, in rare cases more than one gene will have been hit, so perhaps we need to have something like this? (To handle up to 3 different gene insertions for a given mutant. And since we have 6 growth conditions for each mutant, we can't strictly index only on mutant ID. A combo of mutant ID, growth condition, and date (in case we need to do repeats) could be a good way to uniquely identify rows?

Gene_1	Feature_1	Confidence_1	Gene_2	Feature_2	Confidence_2	Gene_3	Feature_3	Confidence_3
value_1_1	value_1_2	value_1_3	value_2_1	value_2_2	value_2_3	value_3_1	value_3_2	value_3_3
...	...	...	...	...	...	...	...	...

I would also propose keeping gene descriptions in a separate table to experiment data? Some of them are very long, and it significantly increases the size of the table. And presumably we won't want to use it in most analysis.

murraycutforth commented 8 months ago

Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat:

- image_features: contains features extracted from the images, such as Fv/Fm, Y2, NPQ, along with experimental information such as temperatures
- identity: contains information about the identity of each mutant, such as well location, plate number, etc.
- mutations: contains information about the mutations in each mutant, such as disrupted gene name, type, confidence level, etc.
- gene_descriptions: contains lengthy descriptions of each gene

There are still a bunch of experiment data not included, which should be added to the image_features table:

TODO: add remaining experimental columns:
 - Start time and date
 - Was there an issue?
 - Temperature under camera (avg, max, min)
 - Temperature in algae house
 - # days M plate grown
 - # days S plate grown
 - Other quantifiers of fluorescence or shape heterogeneity

codercahol commented 8 months ago

As I understand it, in rare cases more than one gene will have been hit, so perhaps we need to have something like this? (To handle up to 3 different gene insertions for a given mutant. And since we have 6 growth conditions for each mutant, we can't strictly index only on mutant ID. A combo of mutant ID, growth condition, and date (in case we need to do repeats) could be a good way to uniquely identify rows?

mutant IDs (the LMJ names) are unique and correspond 1-to-1 with unique well locations (with the exception of wild types which are replicated) in this experiment. Mutant IDs correspond to the name of a strain that was kept in a "biobank" somewhere (many of the strains are from collaborators in Minnesota). Ultimately and scientifically, we want to understand what each gene does, but in practice the strains we are using aren't ideal/clean knockouts so we measure the physiology of each strain/well.

I think it's important to have the initial database reflect the fact that the data we are collecting is by well/strain.

Since in future experiments, we may collect replicated measurements of the same mutants, I think the unique ID should be the well-location (ie [plate #]-[plate location])

codercahol commented 8 months ago

I would also propose keeping gene descriptions in a separate table to experiment data? Some of them are very long, and it significantly increases the size of the table. And presumably we won't want to use it in most analysis.

Sounds like a good idea

codercahol commented 8 months ago

Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat

Keeping the different steps for the data-munging of each of the categories is good, for clarity, but keeping 4 separate dataframes adds complexity. With parquet, we can also selectively read columns from the saved file. And the other dataframes asides from image_features are rather small and we don't know exactly in what combination we will be using the data yet, so I'm going to save the different tables as one, with the exception of the gene descriptions.

codercahol commented 8 months ago

Okay after playing around with this, I've settled on four separate tables, which eliminates repetition and keeps things neat

I agree with the goal: we should produce a data table indexed by mutated gene, but we need to do the analysis first that will let us get there

codercahol / chlamy-ImPi

Database creation #7