genophenoenvo / terraref-datasets

Repository for code and small datasets derived from the TERRA REF program
MIT License
0 stars 3 forks source link

Curate initial dataset for ML team #1

Closed dlebauer closed 4 years ago

dlebauer commented 5 years ago

These are the data that we want to start with:

image

As an idea, there are some examples curated for another project here: https://terraref.ncsa.illinois.edu/d3m/ but ... the shape of the data will be different, and we also need to work on how to properly curate / annotate these (#2).

Genomics data

Need to get feedback from Pankaj on what genomics data to include; hopefully can use data from http://datacommons.cyverse.org/browse/iplant/home/shared/terraref

Phenotypes

Environment

dlebauer commented 5 years ago

We will use these from the BAP population: http://datacommons.cyverse.org/browse/iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/gvcf

There is one file per genotype

This is the combined file: http://datacommons.cyverse.org/browse/iplant/home/shared/terraref/genomics/derived_data/bap/resequencing/danforth_center/version1/hapmap

diatomsRcool commented 5 years ago

Here's an example of what I'm imagining the data will look

Plot Soil Moisture Day 1 Max Soil Moisture Day 1 Min Soil Moisture Day 1 Mean etc for all params and all days Emergence Date all phenotypes growing degree days at emergence Anything else you think is interesting
Plot 1 4.2 2.3 3.2 etc 20 etc 10 etc

I'm thinking we should be using dates that just count from time zero at planting and not actual calendar dates.

dlebauer commented 5 years ago

@diatomsRcool so to confirm, you are looking for a table with one row per plot? I expect these will end up 100s of columns wide if we have min/mean/max for each environmental parameter x day.

Other than soil moisture, each column of environmental data will have the same value repeated.

diatomsRcool commented 5 years ago

Yes. @remcochang and @rossarun can override me.

MagicMilly commented 4 years ago

@diatomsRcool @remcochang @rossarun

Works in Progress:

You can comment on the spreadsheets and gists directly, but it would be best to post questions and comments here so that we can all see them. Thank you!

dlebauer commented 4 years ago
MagicMilly commented 4 years ago

First iteration of ML training data can be found in the Google Drive for now. Closing this issue and creating new ticket(s) for second iteration and including genomics data.