Create prototype of data table from downloaded NEON plant community composition and phenology data

MagicMilly commented 4 years ago

Referring to this data product: https://data.neonscience.org/data-products/DP1.10058.001

[x] See their README for a good example of a readme contents
Step 1
- [x] Download a dataset
- [x] transform to a table similar to the datasets generated for TERRA REF Sorghum, columns including species, location, date, cover, height (this info included in plant presence download)
Step 2
- [x] get feedback on a sample of data and data cleaning code from David and Ryan
- [ ] review additional sites / date ranges / data to prepare
Step 3 apply to additional datasets
- [ ] download from additional sites
- [ ] revise script per feedback and to process from additional sites
- [ ] May need to narrow traits we need by sites (and other data available for those sites)

rbartelme commented 4 years ago

@MagicMilly this may be blocked by my related issue (which is in the DIAG org repo at the moment, but I'm not sure if it's actually a blocker for you yet). I don't know if I shared the notes from meeting with Dr. Stanish RE: the metagenomes with you. Let me know if you need to schedule a meeting next week to discuss some of this in more detail.

MagicMilly commented 4 years ago

Do you think the prototype table for one trait would be blocked, or just the narrowing down bit? If the former, then yes I would like to discuss.

rbartelme commented 4 years ago

@MagicMilly I think that testing for a single trait would be fine. Narrowing the temporal scale will be dependent on whenever I can get the microbial data onto the new cluster.

MagicMilly commented 4 years ago

Moving to next Sprint and may need to break up into smaller tickets for all the tasks / steps. One site file for a specific trait can be cleaned up to look like this, but additional data/metadata are needed to understand the table (e.g. the meaning of the endDate column, which only contains eight unique values).

Slice of the dataset linked above Screen Shot 2020-09-01 at 10.12.38 AM.png

MagicMilly commented 4 years ago

Still working on Step One after receiving feedback on initial table, in addition to possibly breaking up ticket into smaller tasks, so am bumping to the next Sprint.

dlebauer commented 4 years ago

This R package might simplify the workflow: https://cran.r-project.org/web/packages/neonstore/index.html

here is an example: https://github.com/eco4cast/neon4cast-neon-download/blob/master/download.R

MagicMilly commented 4 years ago

Thank you very much - this looks extremely helpful! I'll start with it and ask Kristina for any R-related help.

MagicMilly commented 4 years ago

Met with @KristinaRiemer today and have a much better understanding of the data in the file we want (the 1 square meter data with the percent cover and height observations). My initial prototype table was incorrect, so I'll be creating that now in pycharm as a script, working with one local file to start. Also working on converting notebooks to scripts as described in #109

MagicMilly commented 3 years ago

@KristinaRiemer I created one output table in this repo, using one input csv in the data folder. I've included all location, date, and plant data. I kept the heightPlantOver300cm column because I've seen data in that column from other sites. No columns were renamed since we don't know when we'll take that step while combining all input data. I can add a script today, and we can chat about next steps for the next Sprint.

KristinaRiemer commented 3 years ago

@MagicMilly the output table isn't actually in the repo, right? And it was generated by the .ipynb in the code folder? Though you just started working with scripts, it would be beneficial to eventually do all of your cleaning work in scripts from start to finish, I think. Let me know when the script is ready!

MagicMilly commented 3 years ago

@KristinaRiemer I converted the notebook to script and included the expected output in the data folder along with the input data file. Let me know if it works for you!

KristinaRiemer commented 3 years ago

@MagicMilly I was able to run the script without an errors and the resulting data file was the same!

I was thinking for next steps that it might be useful to collaborators to have some summary stats about this data file? Maybe number of unique sites (from lat/lon), a genus/species list w/ numbers (the scientificName column is a mess but the TaxonID one might be useful), etc.

The heightPlantOver300cm might not be that useful for plant height because that threshold is 9 feet, which is pretty tall for most plants, and none of the plants in this dataset exceed that.

MagicMilly commented 3 years ago

Yay! That is a great idea for next steps, thank you. I'll write up a new ticket for you to review.

dlebauer commented 3 years ago

To clarify, the goal is to have a data table that is as close to the ones our collaborators have been using as possible

same column names, units, formats.
for the first draft it should exclude any other columns
e.g. in the original request it was "species, location, date, cover, height"; to be more specific
- namedLocation --> sitename (maybe concatinate with sitename_plotID,_subplotId)
- decimalLongitude --> lat
- percentCover --> canopy_cover, etc
- keep uid so we can always go back and get more information

Also, please use the neonstore R package to download the data, something like https://github.com/eco4cast/neon4cast-neon-download/blob/master/download.R#L18

KristinaRiemer commented 3 years ago

@MagicMilly do you have a followup issue for @dlebauer's previous comment?

MagicMilly commented 3 years ago

I don't think I do yet, but I'll create one today and tag you

MagicMilly commented 3 years ago

Closing this issue; follow-up ticket #117 incorporates feedback on this ticket

genophenoenvo / terraref-datasets

Create prototype of data table from downloaded NEON plant community composition and phenology data #95