IMCR-Hackathon / Hackathon-Central-2018

Command center for IMCR Hackathon participants to share ideas, coordinate teams, develop projects and access all logistics information
3 stars 0 forks source link

integrating multiple streams of linked data #9

Open srearl opened 6 years ago

srearl commented 6 years ago

A data challenge that I wrestle with is integrating multiple streams of linked or related data. An example would be a research effort that involves collecting environmental samples then running a series of analyses on those samples. The analyses could be field measurements, a manual process conducted in the lab, addressed using instrumentation, or many others, and any combination of those. The outcome of each step or analysis must be related to other outcomes or workflows. I use custom web applications and databases to address this but that approach is complicated and a lot of work. It would be great if there was a platform or tool that could be used for such workflows that would be generalizable enough to cover a wide array of situations and use- cases.

jhp7e commented 6 years ago

Some of the tidyr functions in R (notably gather, spread, separate and unite, linked with base R merge) could be helpful with this.... The big challenge will be that the links that glue things together for merging are often fuzzier than one might like.... I've got one dataset where some data is reported to the year, month and day, but other related data are only reported to the nearest year and month. Coding of stations can also be inconsistent..... An interesting problem!

kcawley commented 6 years ago

NEON uses "named locations", date/time, sample ID, and sample class to link together this type of data on the OS side. For the IS side we use a "measurement stream", which is a combination of sensor (e.g., all air temperature sensors are given a unique ID and also an ID for the part number that they all share), sensor stream (i.e., temperature or pressure, etc.), and named location. NEON's isn't necessarily the best framework, but whatever is used, a consistent ontology and database to track terms and IDs is essential.