Merging multiple trait "metadata" files under one record

serbinsh commented 4 years ago

I may have had this thread before but I am having an issue with a new dataset on EcoSIS. We are merging both leaf and canopy spectra files together with a metadata file and two separate trait/gasex files. We are loading the spec files as data (green) and the main metadata file as well as the leaf trait and gasex files as metadata (blue). All files have the same unique ID to link all of the data together. In the viewer on the web page when you flick through the spec you can find all of the data linked together. However, when i pull the data via the API in R or download the data "linked" only the leaf trait data is connected and the gasex observations are ignored? How do we load the data to make sure all associated data/metadata are connected together?

The new record in question is: https://ecosis.org/package/5090905b-176c-4d17-bf60-59a69939eea6

jrmerz commented 4 years ago

On first inspection, this looks like a bug as the dataset appears to be joined correctly. Will investigate.

serbinsh commented 4 years ago

@jrmerz ok thanks for the update

jrmerz commented 4 years ago

Actually, I might take that back. Looking at two different leaf spectra 2018-04-03_79 and 2018-03-21_65

2018-04-03_79 can be accessed at https://ecosis.org/api/spectra/search/5090905b-176c-4d17-bf60-59a69939eea6?text=&filters=%5B%7B%22Latin%20Species%22%3A%22%2F%5Esativus%24%2F%22%7D%5D&start=86&stop=87 and appears to have leaf traits attached.

2018-03-21_65 can be accessed at https://ecosis.org/api/spectra/search/5090905b-176c-4d17-bf60-59a69939eea6?text=&filters=%5B%7B%22Latin%20Species%22%3A%22%2F%5Esativus%24%2F%22%7D%5D&start=0&stop=1 and does not have leaf traits. However upon downloading leaf_traits.csv, 2018-03-21_65 does not exist in the traits spreadsheet, so there is nothing to join.

Please let me know if I am missing something here.

serbinsh commented 4 years ago

Oh boy ok let us look into this. Could be a data join issue, as you state.

serbinsh commented 4 years ago

@jrmerz After reviewing you comment, this is the expected behavior and matches some other datasets of ours. That is we upload spec and traits and there isnt always 1 to 1 matching; sometimes a trait doesnt have a spec or a spec doesnt have a trait to match with post QA/QC or due to other issues. However this generally doesnt cause us a problem we just get empty cells when using the data, which is the correct behavior. The larger issue is that we have 2 different associated "trait" datasets to connect with the spec, and only 1 is coming with download or via API. however if you view the data on the website then you can see the cases where they both link so its unclear why when downloading all the trait data isnt linked?

jrmerz commented 4 years ago

Can you provide me with an example uniquefield field value that has the issue so I can inspect?

regnans commented 4 years ago

Here are some examples of different combinations. 2018-03-26_65 has a leaf_spectra and leaf_gas_exchange trait. 2018-03-29_17 has a leaf_spectra, and both leaf_traits and leaf_gas_exchange.

serbinsh commented 4 years ago

@jrmerz and thoughts or updates on this? Let us know if we should re-structure or modify to make the data better match expectations. We are working on getting the data published so we do expect needing a DOI and finalize version at some point in the future. Just not clear at the moment how we move forward

serbinsh commented 4 years ago

...we will also have another similar dataset to upload soon so if there is anything we can learn from this to make that one go more smoothly, that would be good to know. thanks!

jrmerz commented 4 years ago

Sorry, this slipped my plate. What's the title of your dataset? It looks like you removed it.

jrmerz commented 4 years ago

Nm, found it, just top link was wrong

serbinsh commented 4 years ago

Thats weird because i just looked again and the ID comes up as: 5090905b-176c-4d17-bf60-59a69939eea6. is that what worked for you as well? I think this matches the ID above? Oh maybe it was public and we pulled it back private and the older public link is broken?

jrmerz commented 4 years ago

Let's get on the same page, please provide the link to the dataset you wish me to look at and is described in the issue above.

serbinsh commented 4 years ago

@jrmerz Its the same ID I just noted

https://data.ecosis.org/dataset/hyperspectral-leaf-reflectance--biochemistry--and-physiology-of-droughted-and-watered-crops

https://data.ecosis.org/import/?id=5090905b-176c-4d17-bf60-59a69939eea6

ID: 5090905b-176c-4d17-bf60-59a69939eea6

Does that help to clarify?

jrmerz commented 4 years ago

Thanks @serbinsh it's a bug with EcoSIS. Has to do with the mapreduce query when a dataset is pushed. Give me a day or two to test and verify fix. I'll let you know when things are good to go.

On your end, once the fix is added to production, you will just need to re-publish the dataset.

serbinsh commented 4 years ago

Awesome, thank you @jrmerz! A few days is no problem. If you need longer let me know as the manuscript is still under review so at the moment this isnt super urgent

jrmerz commented 4 years ago

@serbinsh I have pushed a fix to the dev server. Would you mind giving it a test with your dataset and making sure everything looks correct? Afterward you are free to delete the dataset from the dev server

https://dev-data.ecosis.org/ Test should show up here after push: https://dev-search.ecosis.org/

serbinsh commented 4 years ago

@jrmerz Working on testing this out. One issue, I cant seem to remember my password for the dev site, and the recovery email isnt coming through. No matter but when I created another user I didnt see a way to add myself to an organization or create one. Thus i cant upload and test at the moment....

jrmerz commented 4 years ago

@serbinsh did you check your SPAM folder? That is where my password recovery emails always end up from ecosis. Other alternatives are 1) I manually add you to org or 2) I generate a temp password for you and send to you offline. Let me know which you prefer

serbinsh commented 4 years ago

@jrmerz Yeah I have scoured all of my spam folders; wasnt sure if my original serbinsh username was under serbinsh@gmail.com serbin@wisc.edu or sserbin@bnl.gov. Could you please generate a temp password for serbinsh (which is part of an org) and send my way, if not too much trouble?

Thanks!

jrmerz commented 4 years ago

ok, if share it with serbinsh@gmail.com, will that work?

serbinsh commented 4 years ago

OK so far it looks like its working:

https://dev-search.ecosis.org/package/09479e5a-22f8-4924-ad20-1562cc900459

For example if you scroll to observation "Spectra 586 of 2462" you will see all the trait data including gasex listed.

Next let me try pulling via API to see if all the data comes into R properly

serbinsh commented 4 years ago

This is promising

**> message("Download complete!")
Download complete!
> names(dat_raw)[1:40]
 [1] "ABA"                   "Amino_Acids"           "Asat"                  "CO2s"                  "Ci"                    "Days_Into_Treatment"  
 [7] "Elemental_C"           "Elemental_N"           "Fructose"              "Glucose"               "H2Os"                  "HDP_Fructan"          
[13] "Instrument"            "LDP_Fructan"           "LMA"                   "Location"              "Measurement_Date"      "Paired_Spectra"       
[19] "Plant"                 "Plant_Age"             "Plot"                  "Pre_or_Post_Treatment" "Proline"               "Protein"              
[25] "Qin"                   "RHs"                   "RWC"                   "Rep"                   "Species"               "Starch"               
[31] "Sucrose"               "Tleaf"                 "Tr"                    "Treatment"             "VPDleaf"               "flow"                 
[37] "gs"                    "uniquefield"           "350"                   "351"

Also confirmed that all 2462 obs are in the R object via API

@regnans Looks like its working for me

here is how I tested the API

#---------------- Close all devices and delete all variables. -------------------------------------#
rm(list=ls(all=TRUE))   # clear workspace
graphics.off()          # close any open graphics
closeAllConnections()   # close any open connections to files

list.of.packages <- c("readr","httr","dplyr","ggplot2")  # packages needed for script
# check for dependencies and install if needed
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

# load libraries needed for script
library(readr)    # readr - read_csv function to pull data from EcoSIS
library(dplyr)
library(reshape2)
library(ggplot2)

# define function to grab PLSR model from GitHub
#devtools::source_gist("gist.github.com/christophergandrud/4466237")
source_GitHubData <-function(url, sep = ",", header = TRUE) {
  require(httr)
  request <- GET(url)
  stop_for_status(request)
  handle <- textConnection(content(request, as = 'text'))
  on.exit(close(handle))
  read.table(handle, sep = sep, header = header)
}

# not in
`%notin%` <- Negate(`%in%`)
#--------------------------------------------------------------------------------------------------#

#--------------------------------------------------------------------------------------------------#
### Set working directory (scratch space)
outdir <- tempdir()
setwd(outdir) # set working directory
getwd()  # check wd
print(getwd())
#--------------------------------------------------------------------------------------------------#

#--------------------------------------------------------------------------------------------------#
### Grab data
print("**** Downloading Ecosis data ****")
ecosis_id <- "09479e5a-22f8-4924-ad20-1562cc900459"  # NGEE-Arctic dataset
ecosis_file <- sprintf(
  "https://dev-search.ecosis.org/api/package/%s/export?metadata=true",
  ecosis_id
)
message("Downloading data...")
dat_raw <- read_csv(ecosis_file)
message("Download complete!")
names(dat_raw)[1:40]
head(dat_raw)
#--------------------------------------------------------------------------------------------------#

#--------------------------------------------------------------------------------------------------#
### Prepare data
Start.wave <- 500
End.wave <- 2400
wv <- seq(Start.wave,End.wave,1)
spectra <- data.frame(dat_raw[,names(dat_raw) %in% wv])
names(spectra) <- c(paste0("Wave_",wv))
head(spectra)[,1:5]

sample_info <- dat_raw[,names(dat_raw) %notin% seq(350,2500,1)]
head(sample_info)
names(sample_info)

ggplot(sample_info, aes(x=Asat)) + geom_histogram()
ggplot(sample_info, aes(x=Starch)) + geom_histogram()
ggplot(sample_info, aes(x=RWC)) + geom_histogram()
#--------------------------------------------------------------------------------------------------#

jrmerz commented 4 years ago

Shawn, this all looks great. I have deployed the fix to production. Please test and you or I can close issue if everything looks good on your end.

serbinsh commented 4 years ago

@jrmerz @regnans OK uploaded the dataset again to the main ecosis site and it seems to be parsing correctly: https://ecosis.org/package/5090905b-176c-4d17-bf60-59a69939eea6

Tested the API and we look to be good now, thanks!

jrmerz commented 4 years ago

Great! Closing issue

CSTARS / ecosis

Merging multiple trait "metadata" files under one record #47