Project #1: Use metajam to import data and metadata

clnsmth commented 5 years ago

The metajam R package reads data and parses metadata into the R environment for most data packages available in the DataONE Network. Using metajam saves us the effort of writing our own parsing algorithms and enables us to use several metadata standards (not only EML).

Note, metajam doesn't support data ingestion use cases 1 and 2 proposed by @lkuiucsb. Those will require their own implementations.

atn38 commented 5 years ago

This would be a great first step to our project! I'm for leveraging the work that's gone into EML records as much as possible. I'm trying metajam out and it works pretty well save for some hiccups. If we scope out some potential issues now there might just be a chance the NCEAS folks resolve them before the hackathon 💯

Edit: what matters most for our purposes is probably parsing in information on columns. EML R package has a function to read in attributeList from an EML document, which metajam uses under the hood. (I'm having trouble installing metajam so just from looking at their code, doesn't seem like they have implemented import from metadata standards other than EML yet.)

brunj7 commented 5 years ago

@clnsmth and @atn38 you should consider contributing to metajam as a potential output of your hackathon; it is an open project and collaborations are welcomed. The motivation of this package is partly coming from the LTER synthesis working group.

On phase 1 of the development, we were focusing on making the download as reliable as possible using all different types of URL (DataONE, repo specific, ....), while keeping the function as simple and generic as possible.

Summary stats and checks leveraging EML are definitely part of the scope for tabular data. We also discussed leveraging units (attempt in this vignette to define what it could look like: https://nceas.github.io/metajam/articles/dataset-batch-processing.html) and have been in talk with the developers of units. Units in EML 2.2 are going to help on this, but there will be all the legacy to take care of.

@atn38 please file an issue regarding the installation challenges. I need to change some dependencies as EML 2.0 package is on CRAN now so I can look into this at the same time.

atn38 commented 5 years ago

@brunj7 thanks for the comments, I'll look into it. Re: installation, it seems more of an issue I have had for a week now with Github packages in general, while installing from CRAN works fine. If I can isolate the problem to something metajam specific, will be sure to report.

clnsmth commented 5 years ago

Hi @brunj7. Thanks for this invitation to collaborate on metajam, and @atn38 for pointing out this package doesn't support metadata standards other than EML. @brunj7, is support for other standards in the project road map?

brunj7 commented 5 years ago

@clnsmth supporting the different metadata standards present within the DataONE federation is in the roadmap; however at this point not a top priority (currently more focusing on handling different file formats, such as geotiff, json, ...). Do you guys have specific metadata and/or data sets in mind?

clnsmth commented 5 years ago

Thanks for this info @brunj7. We don't have a specific metadata format or dataset in mind. Just considering the extensibility of metajam in the context of it's suitability for reading data and metadata into the visualization app we'll be developing. IMO it's perfect for our use case.

Where may I find a list of currently supported data types and a prioritized list of "to be" supported data types? This info could help direct our development efforts, both on the visualization app and on metajam (if we have the bandwidth).

brunj7 commented 5 years ago

@clnsmth our approach is grouping supported data file types by R packages, as user can swap the read function to read files. The download function is agnostic from the file type, although the metadata parsing could be improved for some object types.

So far, we have tested on:

Delimited tabular data (readr (default) and base fct) ; our main focus due to metadata richness
Excel files (readxl)
Geotiff (raster) -- see https://nceas.github.io/metajam/articles/reading-raster.html

The next types we want to test/support is focused on geospatial:

Netcdf (both raster and ncdf4 packages)
shapefiles: interesting case as they are zipped most of the time (rgdal, sf)
geojson

I think the handling of geospatial metadata information is something of interest to be improved in the download_d1_data function

In a medium term, our main goal is moving towards detecting the file type using dataONE formats (https://cn.dataone.org/cn/v2/formats) and swap the default function used by read_d1_files accordingly.

clnsmth commented 5 years ago

We did this.

IMCR-Hackathon / Hackathon-Central-2019

Project #1: Use metajam to import data and metadata #6