NCEAS / metajam

Bringing data and metadata togetheR
https://nceas.github.io/metajam/
Apache License 2.0
16 stars 7 forks source link

Adding support of Gulf of Alaska Data Portal #132

Closed brunj7 closed 7 months ago

brunj7 commented 3 years ago

the LTER NGA site is using a specific data repository: https://gulf-of-alaska.portal.aoos.org/#

We need to add the support of it to metajam. This data repository is part of DataOne so we can still rely on its API to access the content. The biggest change is this repository is not supporting EML metadata standard.

We will thus to:

Data for test: https://doi.org/10.24431/rw1k45w

This should be done on the dev branch.

kristenpeach commented 3 years ago

Progress

I am going to do a little exploring on my Rstudio on my computer but I will work on the dev branch when have a sense of where the changes need to be made.

The first error I run into is when I try to use the download_d1_data() function (which I expected). I am just going to take some notes here to keep track of my thought process.

I am used to just copying the URL of the download button of the dataset. But for this one the metadata is in a separate file from the csv file so I think I need to download multiple files. In any case it looks like changes need to be made to the read_d1_files.R document. That is where it looks like it makes sense to insert an if statement about the metadata type. Similarly, we need to make an analogous function to "tabularize_eml()" for this other type of metadata.

I see that someone has added a 'To do' line to the 'download_d1_data.R' file:

`download_d1_data <- function(data_url, path, dir_name = NULL) {

TODO: add meta_doi to explicitly specify doi`

This seems like it would solve some of the problem because we could insert the doi/url to the metadata doc separate from the data.

I am revisiting the metadata.R doc from arcticdatautils() because I know that there are some creative functions in there for XML, GMD files.

kristenpeach commented 3 years ago

Progress

I'm having a pretty hard time knowing where to start with this. I'm getting the feeling I am making it harder than it needs to be. The metadata file is a gmd file (format type = http://www.isotc211.org/2005/gmd) so I assume I need to change multiple files in the metajam package to allow files with that format type but also maybe all of the other gmd format types as well?

Screen Shot 2021-04-30 at 2 07 48 PM

Will add any breakthroughs later in the day

mbjones commented 3 years ago

@kristenpeach Those other format types represent different variants of the ISO 19115 family that are in use specifically for NOAA and Pangaea. You should be able to work against vanilla gmd metadata and have it work for all three, but note that the other two have additional changes that make them not validate under the original schema. But they are 99% the same.

One challenge you will likely have with gmd is that it doesn't generally represent entity and attribute-level metadata in the same way as EML, and its not really built to support multiple data entities (e.g., tables, raster images) in a single ISO document. We've been discussing this wrt how we do metadata completeness checking in MetaDIG, and there are no easy answers. A lot of ISO documents describe a Dataset as a whole without providing the details needed to parse the data files. Happy to discuss on slack if you'd like.

kristenpeach commented 3 years ago

Progress

Explored the use of the geometa package (https://github.com/eblondel/geometa/wiki) to convert non-eml to eml within the metajam package. If this works well we can just insert an if statement early on so that metajam can convert any non-eml to eml and then proceed normally. Here is an excerpt from the geometa package Wiki on its ability to convert metadata:

"4.3 Convert metadata objects from/to other metadata languages (mapping) geometa offers the capacity to convert objects from/to other metadata languages. The object is to provide a generic interoperable mechanism to convert R metadata objects from one metadata standard to another.

At now the focus was given on the mapping between ISO/OGC metadata (modeled in R by geometa) covering core business metadata elements with two widely used metadata formats which are:

NetCDF-CF Conventions - Climate and Forecast conventions - (modeled in R with ncdf4) EML (Ecological Metadata Language) (modeled in R with EML and emld)"

kristenpeach commented 3 years ago

Progress

Julien helped me figure out where to start. I am going to work on the download_d1_data function rather than the download_d1_data_pkg function. The gist of the issue is that if the XML file in the package is not in eml, the download_d1_data() function (paired with the read_d1_files function as seen in the download-single vignette of metajam) it will produce a list of length 2 that includes summary_metadata and data. In comparison, when the download_d1_data function and read_d1_files function are applied to data associated with an XML file written in eml it produces a list of length 3 that includes attribute_metadata, summary_metadata, and data. So we need to make the metajam functions produce attribute metadata for gmd XML files.

This Line of the download_d1_data.R file feels like a good break point to determine if an xml doc is eml or gmd. We need to determine the class of the meta_obj before passing it to as_emld.

eml <- tryCatch({emld::as_emld(meta_obj, from = "xml")}, # Identify XML file and use it to make EML object error = function(e) {NULL})

When I use download_d1_pkg() to retrieve this file 'nga_SKQ201810S_seabird_survey_data_L0.csv' from the sample package it can find the metadata and says that it is eml. That is because the as_emld function coerces any input into an emld object. As we predicted it just fails to produce attribute level metadata but it does parse data and summary metadata fine.


# Directory to save the data set
path_folder2 <- "DataOne_test2"

# URL to download the dataset from DataONE
data_url <- "https://cn.dataone.org/cn/v2/resolve/81b1aecf-329a-48d4-b706-a39c1607e067"

# Create the local directory to download the datasets
dir.create(path_folder2, showWarnings = FALSE)

# Download the dataset and associated metdata 
data_folder <- metajam::download_d1_data(data_url, path_folder2)

example_data <- metajam::read_d1_files(data_folder)

example_data$summary_metadata$name

example_data$data$Species

So before I use the as_emld function I need to insert a split where I determine the class of the metadata object. I should be able to use the DocType function from xml2 package to determine the class that is provided at the top of an XML document.

I am getting a little hung up on testing geometa because to use the convert_metadata() function of geometa you need to input the appropriate format id of the object and then the format id of what you want it to be. There is a helper function in geometa called getMappingFormats() to help you pick the right format id but it returns NULL and there is no R script for it in the geometa package on Github so I can't poke around to find what it is supposed to return. I could certainly make an educated guess but it looks like this function is still in development and not ready for use (https://rdrr.io/github/eblondel/geometa/man/convert_metadata.html)

mbjones commented 3 years ago

@kristenpeach the code for getMappingFormats() in the geometa package is at https://github.com/eblondel/geometa/blob/master/R/geometa_mapping.R#L130

kristenpeach commented 3 years ago

Thank you! @mbjones

kristenpeach commented 3 years ago

Progress

I tried various configurations of inserting geometa into the existing download_d1_data.R code at that diversion point I discussed in the comment above. Before I fuss wi

If I replace this:

# eml <- tryCatch({emld::as_emld(meta_obj, from = "xml")},  # Identify XML file and use it to make EML object
#                error = function(e) {NULL})`

with this: ` out_eml <- geometa::convert_metadata(meta_obj, from = "geometa|iso-19115-1", to = "eml", mappings = geometa::getMappings(), verbose = FALSE)

eml <- emld::as_emld(out_eml) `

And then run the whole download_d1_data.R with a data_path for a dataset I know has that geometa|iso-19115-1 XML file type it feels like I should be getting attribute data in the output. Based on the wiki I am not sure if I need to generate the data.frame with the metadata mapping rules for this conversion or if this is one of the conversions that already has mapping rules built into the function. This makes me think I should expect 100% coverage? Going to revisit in an hour with a fresh brain

Screen Shot 2021-05-05 at 4 59 38 PM
kristenpeach commented 3 years ago

Progress

I realized you need to set pretty = FALSE to get the getMappingFormats() function to work. But geometa::getMappingFormats(pretty = FALSE) will show the available metadata formats that geometa can convert. Based on that function there are two flavors of ISO XML supported by geometa and they are 'geometa|iso-19115-1' (row 2 of the table above) and 'geometa|iso-19115-2' (row 3 of the table above).

For the example data package I have been using (https://search.dataone.org/view/10.24431/rw1k45w) the file named "Metadata: Marine bird survey observation and density data from Northern Gulf of Alaska LTER cruises, 2018" has a listed file type = http://www.isotc211.org/2005/gmd. The first line of the XML doc is "gmd:MD_Metadata xmlns:gmd="http://www.isotc211.org/2005/gmd". Does the "2005" element indicate that it corresponds to the XML format id in row 1 of the table above ? Because that format does not appear in the acceptable formats listed by geometa::getMappingFormats() (perhaps because this function is still under development?). Alternatively, if the namespace is GMD shouldn't it be the format type of row 2 of that table? Is it correct to assume that the "Supported" column of that table indicates the number of elements that can be converted? I think thats right. It looks like geometa is capable of mapping a lot of the attribute level data we would want for GMD files. To convince yourself of this you can run geometa::getMappings() or go to https://github.com/eblondel/geometa/blob/master/inst/extdata/coverage/geometa_coverage_inventory.csv.

Description of convert_metadata function from geometa package:

' @description \code{convert_metadata} is a tentative generic metadata converter to

' convert from one source object, represented in a source metadata object model in R

' (eg eml) to a target metadata object, represented in another target metadata object

' model (eg \pkg{geometa} \code{\link{ISOMetadata}}). This function relies on a list of

' mapping rules defined to operate from the source metadata object to the target metadata

' object. This list of mapping rules is provided in a tabular format. A version is embedded

' in \pkg{geometa} and can be returned with \code{\link{getMappings}}

It feels like the problem lies in the mappings parameter of the convert_metadata function. The table called by geometa::getMappings() does not have column names that correspond to the format id's that you can list as the other parameters. So it seems like maybe its not connecting the 'from = "geometa|iso-19115-1"' parameter in the function to the 'geometa' column in the getMappings() table? Or I am totally wrong and its just not running because of some coding error on my part. Another idea. The line before the lines I am changing in download_d1_data is this: meta_obj <- dataone::getObject(d1c@mn, meta_id). I know that this line is running correctly for both the eml file I am testing and the xml file I am testing. The getObject function returns a meta_obj in raw format. So I think the convert_metadata function is expecting an XML object with formatting rather than raw. I will try to experiment with that.

out_eml <- geometa::convert_metadata(meta_obj, from = "geometa|iso-19115-1", to = "eml", mappings = geometa::getMappings(), verbose = FALSE)

The code for the convert_metadata function begins at Line 720: https://github.com/eblondel/geometa/blob/master/R/geometa_mapping.R#L130

kristenpeach commented 3 years ago

Progress

I tried to give the convert_metadata function xml files in different formats to see if it would work. Realized I don't understand what most of the functions in the XML package actually do so spent some time trying to understand them. Had to shift to other projects mid-day to avoid throwing my computer out the window. I will return to it re-energized from the weekend on Monday!

https://www.youtube.com/watch?v=1cM_ZNZ9hhE

http://www.cse.chalmers.se/~chrdimi/downloads/web/getting_web_data_r4_parsing_xml_html.pdf

https://www.rdocumentation.org/packages/xml2/versions/1.3.2

kristenpeach commented 3 years ago

Progress

I tested the convert_metadata() function with an eml metadata object and converted it to ISO. It worked fine. But that does confirm what I thought about the function failing because my input for the metadata parameter was a meta_obj in raw format. In the example below 'polaris17_permafrost' is a package of data, summary metadata, and attribute metadata pulled from the Arctic Data Center (eml). So the pivot point in the code (where we ask it to determine if the XML file is in eml or ISO) needs to be before the creation of the meta_obj, not after.

test_meta_obj_eml <- polaris17_permafrost$attribute_metadata

out_eml <- geometa::convert_metadata(test_meta_obj_eml, from = "eml", to = "geometa|iso-19115-1", mappings = geometa::getMappings(), verbose = FALSE)

test2_meta_obj_eml <- polaris17_permafrost$summary_metadata

out_eml2 <- geometa::convert_metadata(test2_meta_obj_eml, from = "eml", to = "geometa|iso-19115-1", mappings = geometa::getMappings(), verbose = FALSE)

The XML file I pass to convert_metadata cannot be in raw format which is the default of dataone::getObject. I converted it to an ISO XMl object and now convert_metadat() runs but the output object has mostly empty fields. That should be disappointing but I am happy I got somewhere. Will continue in this direction the rest of the day.

`meta_df <- rawToChar(dataone::getObject(d1c@mn, meta_id2)) meta_iso_xml <- XML::xmlTreeParse(meta_df) metadata_nodes2 <- dataone::resolve(cn, meta_id2)

out_eml <- geometa::convert_metadata(meta_iso_xml, from = "geometa|iso-19115-1", to = "eml", mappings = geometa::getMappings(), verbose = FALSE)

eml <- emld::as_emld(out_eml)`

kristenpeach commented 3 years ago

I think the trick will be using xmlToDataFrame() to make an object that is just the metadata node of the ISO meta_obj and inputting that to convert_metadata()

kristenpeach commented 3 years ago

More Progress

I was able to get a 'flat' data frame version of ISO XML to use as an input for convert_metadata() which seems to be the format it wants.

xml.dataframe <- fxml_importXMLFlat("https://cn.dataone.org/cn/v2/resolve/2012b3a7-f6b0-4e46-b2fa-63bf4ae6ba25")

out_eml <- geometa::convert_metadata(xml.dataframe, from = "geometa|iso-19115-1", to = "eml", mappings = geometa::getMappings(), verbose = FALSE)

eml <- emld::as_emld(out_eml)

The convert_metadata function runs and produces all of the same elements it did when I tried it out in the reverse direction (eml to ISO) but it was basically just a big empty nested list. When I run the geometa_mapping.R doc it produces several versions of this warning message: "in method for ‘coerce’ with signature ‘"ISOMetadata","emld"’: no definition for class “ISOMetadata”". I still feel like I have a better sense of where the problem is than I did this morning though.

kristenpeach commented 3 years ago

@mbjones Have you gotten the geometa::convert_metadata() function to successfully convert ISO to eml? Or is that why you were talking with the maintainer eblondel? Whenever I try it it fails to identify/map attribute level metadata. I assume this is because of the differences in how attribute level information is stored in gmd vs. eml. You warned me that I should only expect a partial translation using convert_metadata(), I just want to make sure this is what you meant.

mbjones commented 3 years ago

Hi @kristenpeach I have not tried it. I suspected that the conversion was incomplete via a quick scan of the documentation and code. My earlier conversations with the maintainer was about our contributing to the conversion, which we weren't able to do at the time. You are the first person I know of that has tried this extensively. You might find others that have used it through either 1) the #eml channel on the NCEAS slack, or the #im channel on the LTER slack.

kristenpeach commented 3 years ago

Progress

To recap: We determined that Lines 86-92 in the download_d1_data.R file is generally where the function starts to fail when the input is a data_url to data in this repo (https://search.dataone.org/view/10.24431/rw1k45w)/ any repo that uses non-eml metadata. For the example below I used the data_url for the file named 'nga_TGX201809_seabird_processed_densities_L1.csv'. This code that shows the method that produces the most complete eml object(s) so far (from non-eml metadata). None of them are great but I think working from the meta_iso_xml object may be the most straightforward.

https://github.com/kristenpeach/metajam/blob/master/reprex_iso_xml_to_eml_GMD.R

I keep thinking we should be able to use arcticdatautils::pid_to_eml_entity()(https://github.com/NCEAS/arcticdatautils/blob/main/R/eml.R) because the way it's written it should work for any DataOne object. I have only used it for the ADC so I have always set the member node to 'adc'. So maybe if I set the DataOne member node (https://www.dataone.org/network/#list-of-member-repositories) to EITHER 'LTER Network Member Repository' or the 'Alaska Ocean Observing System' where this example data is originally housed (https://gulf-of-alaska.portal.aoos.org/) I might be able to use this function or something similar to it? Feels like there may at least be some good clues in the eml.R doc of arcticdatautils.

Asked if anyone has used the convert_metadata() function in the eml NCEAS slack channel

kristenpeach commented 3 years ago

Progress

After chatting with Jeanette and Bryce on slack I think we may need to reassess our goals for making metajam work will non-eml. We talked about me adding an issue to geometa to make sure I was not using convert_metadata() incorrectly or passing it parameters of the wrong class. But it looks like an issue for our problem (or a very similar problem) already exists (https://github.com/eblondel/geometa/issues/169) it was posted in June 2020.

One option would be that we make a totally new function that is analogous to download_d1_data.R but specific to ISO XML (or even specific to the flavor of ISO that is used by the repository of this one LTER site which wants to use it). Bryce's efforts with dataspice (https://github.com/ropenscilabs/dataspice#convert-to-eml) are probably a good place to start. We may have to use create_spice() to prompt ISO metajam users to do some manual entry for the attributes. That feels a little clunky but if the primary immediate goal is to get it working for this one LTER site maybe it would be ok.

brunj7 commented 3 years ago

@kristenpeach Thank you for all the investigation on this and the reprex! This is all good progress.

I agree that we should probably focus on mapping what we can for the summary metadata table metajam produces and set the rest to NA. Some of the info comes from the D1 API (see below) so that should be OK. So I think getting: the title, an abstract, and a person (name) of contact from meta_iso_xml would be already great.

Field Provenance
Metadata_ID D1
Metadata_URL D1
Metadata_EML_Version Metadata
File_Description Metadata
File_Label Metadata
Dataset_URL D1
Dataset_Title Metadata
Dataset_StartDate Metadata
Dataset_EndDate Metadata
Dataset_Location Metadata
Dataset_WestBoundingCoordinate Metadata
Dataset_EastBoundingCoordinate Metadata
Dataset_NorthBoundingCoordinate Metadata
Dataset_SouthBoundingCoordinate Metadata
Dataset_Taxonomy Metadata
Dataset_Abstract Metadata
Dataset_Methods Metadata
Dataset_People Metadata

Then downloading the data should be fine since we are getting the filename from the sys metadata. As example:

## Set Nodes ------------
data_nodes <- dataone::resolve(dataone::CNode("PROD"), data_id)
d1c <- dataone::D1Client("PROD", data_nodes$data$nodeIdentifier[[1]])
cn <- dataone::CNode()
data_id <- "a0b7cf1a-bdbf-407e-be30-4c4ebd7d2dfc"

data_sys <- suppressMessages(dataone::getSystemMetadata(d1c@cn, data_id))
data_name <- data_sys@fileName
out <- dataone::downloadObject(d1c, data_id, path = "~/Desktop")

So the existing code should work

kristenpeach commented 3 years ago

Thank you @brunj7 !

kristenpeach commented 3 years ago

Progress

Fields for summary metadata that are not produced by as_emld: File_Description, Dataset_StartDate, Dataset_EndDate, Dataset_Location, Dataset_WestBoundingCoordinate, Dataset_EastBoundingCoordinate, Dataset_SouthBoundingCoordinate, Dataset_NorthBoundingCoordinate, and Data_Methods. The great news is as_emld does a great job of getting the most important summary metadata without any help. And each of these missing features HAS a corresponding field in ISO they are just not exact. The Data_Methods section has multiple possible inputs so I picked the one that made the most sense to me.

If you run this function (https://github.com/kristenpeach/metajam/blob/master/R/download_ISO_data.R) and then run this code you should see a more complete metadata output:

path_folder <- "DataOne_ISO_test"

data_url <- "https://cn.dataone.org/cn/v2/resolve/a0b7cf1a-bdbf-407e-be30-4c4ebd7d2dfc"

dir.create(path_folder, showWarnings = FALSE)

data_folder <- download_ISO_data(data_url, path_folder)

example_data <- metajam::read_d1_files(data_folder)

*You will need the utils function and the check_version function from metajam so that would need to be installed too.

kristenpeach commented 3 years ago

Progress

I worked on a few other projects today so not a ton of progress but some. It seems like ISO is really customizable so its possible this won't work for other data packages. I went looking for another data set from the Alaskan Ocean Observing System member node that had non-eml metadata so I could test out my mini function and it looks like a lot of the other packages use EML which is good. I found a package that used ISO metadata (https://search.dataone.org/view/10.24431%2Frw1k57t) and the function worked 99% as expected on this package. It found the Dataset_WestBoundingCoordinate, Dataset_SouthBoundingCoordinate, and Dataset_NorthBoundingCoordinatebut not the Dataset_EastBoundingCoordinate? Will poke around to figure out if that is a function problem or a metadata problem.

path_folder <- "DataOne_ISO_test2_research_workspace" data_url <- "https://cn.dataone.org/cn/v2/resolve/16c5847d-a2e4-435f-b174-cb81f9d35568" dir.create(path_folder, showWarnings = FALSE) data_folder <- download_ISO_data(data_url, path_folder) example_data <- metajam::read_d1_files(data_folder)

After inspecting the metadata it does look like the east bounding coordinate is actually missing (rather than the function just failed to find it). So the mini-function worked exactly as I hoped it would. After looking at the raw metadata I am less sure that I picked the right entry for the Data_Method field. Or rather, there are really multiple fields that should be concatenated together to populate that cell. To see what I mean run the code above and then create the parsed XML to inspect: meta_iso_xml <- XML::xmlTreeParse(meta_raw). Tomorrow I will try a few other non-eml datasets and maybe test out a method for concatenating methods descriptions to provide a more complete overview in the summary_metadata output.

kristenpeach commented 3 years ago

Progress

I was trying to use my download_ISO_data.R function on a few other ISO data packages and kept getting this error at the getObject() point in the function:

meta_obj <- dataone::getObject(d1c@mn, meta_id) Error in .local(x, ...) : get() error: Error in curl::curl_fetch_memory(url, handle = handle): server certificate verification failed. CAfile: none CRLfile: none

I tried it on my local R as well as aurora R and got the same error. From my Googling it seems like this may be something I can fix but may also be a dataone package level thing? I will update on other progress during our meeting tomorrow

brunj7 commented 3 years ago

@gothub Have you encounter this problem before with the R dataone package?

mbjones commented 3 years ago

Just a guess here, but its most likely an expired SSL certificate on the member repository. Given that you are retrieving a metadata object, and the DataONE CN keeps a copy of all metadata objects, you could failover to the CN node by trying dataone::getObject(d1c@cn, meta_id) and see if that gets the object better. The other option is to use resolve to get the list of locations where that object is replicated, and try retrieving it from each until one succeeds.

kristenpeach commented 3 years ago

Progress

Thank you for the help @mbjones ! When I run metadata_nodes <- dataone::resolve(cn, meta_id) I can see where it's replicated but I have not tried accessing it from those nodes yet.

Made a plan with Julien for next steps. I change Metadata_EML_Version to Metadata_ISO_Version and populated that field using the meta_ISO_xml object. I realized it wasn't finding that correctly from the meta_obj because as_emld basically overrides any other metadata language because it squishes everything into EML format and then correctly lists the metadata format as EML. I used the data_name and data_extension to fill the File_Label and File_Description. I think that those might not be what I should be using to populate those fields but I will have to run my eml example again to see what I should actually be trying to put there. But across all of these fields I am feeling good about how much information we can give the user.

Screen Shot 2021-05-20 at 4 27 04 PM

I'm a little surprised that it did not successfully find the Dataset_Location for this example so I want to make sure I am using the best/broadest ISO xml location for this feature. Also, you can see that it accidentally snatches up some extra text for certain fields (the value for Dataset_People begins with "template"). I don't think this is a big deal but I will keep an eye on it when I try other ISO datasets to make sure its not scooping up too much extra stuff.

To Do

mbjones commented 3 years ago

@kristenpeach the breakdown of metadata formats by repository on DataONE is a simple facet query:

https://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA&fl=identifier,formatId&facet=true&facet.pivot=datasource,formatId&rows=0&wt=json

In those current results, check out ARCTIC and KNB for some nodes that support multiple types.

kristenpeach commented 3 years ago

Progress

Thank you Matt!

I ended up removing the field for taxonomic coverage. I think that that particular keyword field happened to have taxonomic information for the dataset I was looking at but that that was not going to be true for other datasets. Feels like it's better to have it blank than incorrect. I don't think File_Label is actually supposed to be the data file's format type but I need to run a few more eml examples (it was blank on the first one I tried) to figure out exactly what it IS supposed to be. I made a provenance table like we talked about. This isn't a final version but kind of what I was thinking:

Screen Shot 2021-05-21 at 6 17 14 PM
kristenpeach commented 3 years ago

Progress

Changed the Metadata_EML_Version/Metadata_ISO_Version feature to just Metadata_Version so that if anyone wanted to merge summary metadata tables produced by ISO and EML datasets they could do so (great idea Julien).

Julien also suggested that instead of keeping my work as its own ISO specific function we should slice the existing download_d1_data.R function in half and ad an if statement. So if the meta data is in ISO run function X and if it is in EML run function y. We agreed that those new language-specific functions wouldn't be exported to external users and would remain internal. The place it makes sense to me to do that is after this line:

meta_raw <- rawToChar(meta_obj)`

The meta_raw object produced by eml metadata includes this string: "eml://ecoinformatics.org/eml". The meta_raw object produced by iso metadata includes this string: "http://www.isotc211.org/". I know that there are other types of ISO that may not play nicely with this "iso specific" function I wrote but I will try to find a few of the different iso formats and run them to see if it totally fails . So something like this will go into the existing download_d1_data.R function and then I will break off the ISO and eml specific tasks into their own functions:

if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == FALSE) { warning("Metadata is in ISO format") new_dir <- download_ISO_data(meta_raw) # add iso function here

} else if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == TRUE) { warning("Metadata is in EML format") new_dir <- download_EML_data(meta_raw) # add eml function here

kristenpeach commented 3 years ago

Progress

Here is the new "short" version of the download_d1_data.R function we talked about: https://github.com/kristenpeach/metajam/blob/master/R/SMALL_download_d1_data_KPEACH.R

And the two new functions that are called within that function: https://github.com/kristenpeach/metajam/blob/master/R/download_ISO_data_KPEACH.R https://github.com/kristenpeach/metajam/blob/master/R/download_EML_data_KPEACH.R

I've been having certificate issues again (Error in curl::curl_fetch_memory(url, handle = handle) : SSL certificate problem: certificate has expired) which makes it hard to test but I will keep working on it. There are a lot of places where it can/needs to be improved with try catch and stopifnot

kristenpeach commented 3 years ago

Progress

Added a few new use cases to the test of download_d1_data.R: https://github.com/kristenpeach/metajam/blob/master/tests/testthat/test-SMALL_download_d1_data.R

Tried to figure out the certificate issue but did not make a lot of headway. I can see that my test dataset is replicated in KNB but when I try to pull from KNB instead it also doesn't work.

kristenpeach commented 3 years ago

Progress

Working on adding more trycatch to the functions. But not so much that errors or incorrect inputs sail through the function without stopping. http://adv-r.had.co.nz/Exceptions-Debugging.html

kristenpeach commented 3 years ago

Progress

I made some improvements to the functions and started working on writing a loop that will try query all possible member nodes where the data is for the metadata object. We don't want future users to have these certificate issues.

Also is this related to my certificate issue?: https://rdrr.io/cran/dataone/f/vignettes/v07-known-issues.Rmd

I was able to use dataone::getObject() perfectly fine today without certificate errors so still don't totally get what happened but would like to try to make metajam more resilient to those types of hiccups with this loop

kristenpeach commented 3 years ago

Progress

Got a working loop for the multiple member nodes. The second code chunk is a minimum reproducible example (but you do need metajam::check_version function loaded in your environment so not totally a reprex) https://github.com/kristenpeach/metajam/blob/master/trying_loop_for_all_mns.Rmd

I need to try this with a few other examples and then actually insert it into the function once I am confident its going to work on different combinations of nodes.

kristenpeach commented 3 years ago

This is the error message I was talking about that pops up when I try to load the EML package

Warning messages: 1: In utils::install.packages("jsonld", repos = "https://cran.rstudio.com/") : installation of package ‘jsonld’ had non-zero exit status 2: In utils::install.packages("emld", repos = "https://cran.rstudio.com/") : installation of package ‘jsonld’ had non-zero exit status 3: In utils::install.packages("emld", repos = "https://cran.rstudio.com/") : installation of package ‘emld’ had non-zero exit status 4: In utils::install.packages("EML", repos = "https://cran.rstudio.com/") : installation of package ‘jsonld’ had non-zero exit status 5: In utils::install.packages("EML", repos = "https://cran.rstudio.com/") : installation of package ‘emld’ had non-zero exit status 6: In utils::install.packages("EML", repos = "https://cran.rstudio.com/") : installation of package ‘EML’ had non-zero exit status

I terminated my R session and started fresh and now I am not getting a problem with it but thought I'd document it in case it becomes a problem. I incorporated my mn loop into the main function to try to get rid of that d1c method R was not liking: data_nodes <- dataone::resolve(dataone::CNode("PROD"), data_id) d1c <- dataone::D1Client("PROD", data_nodes$data$nodeIdentifier[[1]]) meta_obj <- dataone::getObject(d1c@mn, meta_id)

If we want more folks using metajam removing the D1Client function is probably a good idea anyway right? The documentation makes it sound like maybe it's function is more for development and testing? https://rdrr.io/cran/dataone/man/D1Client.html

I made a table of the DataOne member nodes already supported by metajam (ADC, LTER/EDI, and KNB according to the documentation) and the nodes that will be supported now that we have increased the capacity of metajam to handle iso metadata. Currently 104,320 data packages can be downloaded using metajam (pooled across ADC, LTER, EDI and KNB). Now that we have made it compatible with ISO metadata we've added 92,276 data packages that can now be pulled into R by metajam. I got these numbers from the solr query Matt provided (https://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA&fl=identifier,formatId&facet=true&facet.pivot=datasource,formatId&rows=0&wt=json). I have not actually tested our new and improved download_d1_data.R with the NOAA subtype of ISO ("http://www.isotc211.org/2005/gmd-noaa") so if we want to be cautious we could exclude that from our count until I test a few data sets with that type of iso metadata (which I will).

Screen Shot 2021-06-09 at 2 25 13 PM
kristenpeach commented 3 years ago

Progress

Sorry I did not update yesterday! I had been on such a good streak. I had to do some work for the NEON team to help Jeff make data cleaning decisions with that new extra Virginia dataset.

Member nodes now (at least partially) supported by metajam: "urn:node:ARCTIC" (expanded support for ADC), "urn:node:ARM", "urn:node:GRIIDC", "urn:node:IEDA_EARTHCHEM", "urn:node:IEDA_MGDL", "urn:node:IEDA_USAP", "urn:node:NCEI", "urn:node:NKN", "urn:node:NRDC", "urn:node:R2R" , "urn:node:RW" .

AAOS (https://gulf-of-alaska.portal.aoos.org/) does not appear as a member node in the solr query but Research Workspace does ("urn:node:RW"). So I think I was misunderstanding that. Our example ISO dataset I've been using (https://search.dataone.org/view/10.24431/rw1k45w) has a doi for the Research Workspace member node but under the Alternate Data Access header it lists the AAOS portal. I just want to make sure I am understanding this correctly. You can find this data package in (at least) 2 places: (1) https://search.dataone.org/view/10.24431/rw1k45w (2) https://gulf-of-alaska.portal.aoos.org/. Research Workspace is not a distinct member node in the same way that the Arctic Data Center is (because the ADC has its own web interface and Research Workspace does not appear to have that). I just want to make sure it's clear that when we say metajam will now work with these member nodes what we mean is it will work if the data_url you enter as a parameter is from the DataOne website. This is different from how it previously worked because, for example, you can go to EDI and right click a 'download data' tab and input that data url as the data_url parameter in metajam's download_d1_data function and it will work perfectly (because it's eml). I have been operating under the assumption that users who want to download our example ISO dataset (or any other ISO dataset) will right click the 'download data' tab on the DataOne website (not the AAOS website) to retrieve the data_url to input into download_d1_data.R. Does that distinction make sense?

I realized one of my problems with my function is that when I run my member node loop to try to find a valid one it is not matching up with the data id. Oof.

kristenpeach commented 3 years ago

Progress

Had to take a break from that function but tested out the functionality of download_d1_data.R using data urls directly from the member nodes sites (like we talked about with EDI).

Metajam (download_d1_data.R) works perfectly on Arctic Data Center data files. It looks like LTER network member node data sets are really only accessible in DataOne which is great (https://search.dataone.org/view/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fknb-lter-hfr%2F61%2F20) and we know it works well with EDI. So it looks like with just the eml member nodes metajam should work fine where you retrieve the data url from.

Preparing a summary table of which other member nodes have 'duplicated' their data on both DataOne and a node-specific site like ADC or EDI where users may try to grab data urls from that may cause problems for them

kristenpeach commented 3 years ago

Today I quit my R session and restarted multiple times and got that same EML package error...absolutely no idea why sometimes I get it and sometimesI dont

kristenpeach commented 3 years ago

Progress

Investigated which other member nodes have data download links outside of the DataOne web browser to return and test once the ISO update to metajam is closer to finished

Atmospheric Radiation Measurement Data Center ("urn:node:ARM"): DataOne: https://search.dataone.org/view/doi%3A10.5439%2F1027370 External Source for Data URL: https://adc.arm.gov/discovery/#/results/instrument_code::nimfraod1mich

IEDA: Interdisciplinary Earth Data Alliance ("urn:node:IEDA_EARTHCHEM") DataOne: https://search.dataone.org/view/doi%3A10.1594%2FIEDA%2F111453 External Source for Data URL: https://ecl.earthchem.org/view.php?id=1453

IEDA: Marine-Geo Digital Library ("urn:node:IEDA_MGDL") DataOne: https://search.dataone.org/view/http%3A%2F%2Fdoi.org%2F10.1594%2FIEDA%2F500049 External Source for Data URL: http://get.iedadata.org/doi/500049

IEDA: US Antarctic Program Data Center ("urn:node:IEDA_USAP") DataOne: https://search.dataone.org/view/urn%3Ausap-dc%3Ametadata%3A601410 External Source for Data URL: https://www.usap-dc.org/view/dataset/601410

Nevada Research Data Center ("urn:node:NRDC") DataOne: NRDC_NEVCAN_SCIENCE_METADATA_Sheep1_Met_OneMin_2017_01_4045_9745--v6.xml (why is this url so weird?) External Source for Data URL: http://sensor.nevada.edu/

Knowledge Network for Biocomplexity ("urn:node:KNB") DataOne: https://search.dataone.org/view/doi%3A10.5063%2F1R6NZ1 External Source for Data URL: https://knb.ecoinformatics.org/view/doi:10.5063/1R6NZ1

Rolling Deck to Repository ("urn:node:R2R") DataOne: https://search.dataone.org/view/doi%3A10.7284%2F902414 External Source for Data URL: https://www.rvdata.us/search/cruise/ZHNG10RR

NOAA NCEI Environmental Data Archive ("urn:node:NCEI") DataOne: https://search.dataone.org/view/%7B850F24D4-5541-449F-80F6-F39E3DBA1FDD%7D External Source for Data URL: https://accession.nodc.noaa.gov/download/171796

Biological and Chemical Oceanography Data Management Office (BCO-DMO) DataOne: https://search.dataone.org/view/doi%3A10.1575%2F1912%2Fbco-dmo.652124 External Source for Data URL: https://www.bco-dmo.org/dataset/651461

California Ocean Protection Council Data Repository DataOne: https://search.dataone.org/view/urn%3Auuid%3Aa2034915-0a3c-4bb5-877f-90dccde603fc External Source for Data URL: none that I can find

kristenpeach commented 3 years ago

Worked on LTER bibliography stuff today for a break!

kristenpeach commented 3 years ago

Progress

Fixed the 'else if' statement that was making my loop through all mn's messy. Terminated R and restarted several times today to try to get EML and emld to load in R studio server but it was not working so moved to my local R and made a lot of progress.

She works!! https://github.com/kristenpeach/metajam

If you load the tidyverse and have the metajam::check_version function, metajam::utils function, download_ISO_data_KPEACH.R function and SMALL_download_d1_data_KPEACH.R function you can run this code and it will work ( I know my function names are ridiculous my old advisor was just had a VERY serious rule that anything we touched had to have our name on it and it's a hard habit to break).

path_folder <- "Data_test" data_url <- "https://cn.dataone.org/cn/v2/resolve/4139539e-94e7-49cc-9c7a-5f879e438b16" dir.create(path_folder, showWarnings = FALSE) data_folder <- SMALL_download_d1_data(data_url = data_url, path = path_folder)

I need to make it cleaner, add more trycatch , test the eml version etc. but happy it's finally working. At some point I think this would be really improved by another set of eyes on it. The member node stuff feels like a bit of a mess...but it works!

kristenpeach commented 3 years ago

Progress Renamed functions to be compatible with metajam. It looks like I already made a new test that for the download_d1_data function that uses test cases with iso. I added two more test cases that use an EDI data url associated with eml v2.2 metadata (one with one data table , the other with multiple)

To Do:

kristenpeach commented 3 years ago

Progress

First attempt at a new vignette highlighting different provenance options for data url depending on member node. Also the first steps toward two new use cases (one eml one ISO) to showcase differences in output. I know we talked about putting that in the Wiki instead so I'll put part of it here:

"## Summary This vignette aims to showcase a use case using the 2 main functions of metajam - download_d1_data and read_d1_files to download one dataset from the DataOne data repository.

"## Note on data url provenance when using download_d1_data.R

There are two parameters required to run the download_d1_data.R function in metajam. One is the data url for the dataset you'd like to download.You can retrieve this by navigating to the data package of interest, right-clicking on the download data button, and selecting Copy Link Address.

For several DataOne member nodes (Arctic Data Center, Environmental Data Initiative, and The Knowledge Network for Biocomplexity), metajam users can retrieve the data url from either the 'home' site of the member node or the from the DataOne instance of that same data package. For example, if you wanted to download this dataset:

Kelsey J. Solomon, Rebecca J. Bixby, and Catherine M. Pringle. 2021. Diatom Community Data from Coweeta LTER, 2005-2019. Environmental Data Initiative. https://doi.org/10.6073/pasta/25e97f1eb9a8ed2aba8e12388f8dc3dc.

You have two options for where to obtain the data url.

  1. You could navigate to this page on the Environmental Data Initiative site (https://doi.org/10.6073/pasta/25e97f1eb9a8ed2aba8e12388f8dc3dc ) and right-click on the CWT_Hemlock_Diatom_Data.csv link to retrieve this data url: https://portal.edirepository.org/nis/dataviewer?packageid=edi.858.1&entityid=15ad768241d2eeed9f0ba159c2ab8fd5

  2. You could fine this data package on the DataOne site (https://search.dataone.org/view/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fedi%2F858%2F1) and right-click the Download button next to CWT_Hemlock_Diatom_Data.csv to retrieve this data url:https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F858%2F1%2F15ad768241d2eeed9f0ba159c2ab8fd5

Both will work with metajam! You will get the same output either way.

We have not tested metajam's compatibility with the home sites of all DataOne member nodes. If you are using metajam to download data from a member node other than ADC, EDI, or KNB we highly recommend retrieving the data url from the DataOne instance of the package (example 2 above)."

Made a vignette with use cases for iso and eml datasets. I made it clear that there would be different outputs between the two but the vignette is pretty long at this point so a user would have to be pretty interested to scroll down and find that information.

kristenpeach commented 3 years ago

I did not mean to remove my assignment?? Idk why it says that

kristenpeach commented 3 years ago

Progress

The download_eml_data.R function that is called within the new download_d1_data function is not working right so Ive been working on debugging that. It feels like I will probably have it working by the end of the day tomorrow but now that I've put that out there into the universe something will probably go terribly wrong

kristenpeach commented 3 years ago

Progress

So the data_url for the eml example dataset I am using appears to be the problem. It is a weirdly short data url: "https://cn.dataone.org/cn/v2/resolve/df35b.296.15"

The function fails at the metadata object creation stage ( meta_obj <- dataone::getObject(mn, meta_id)) even with a valid mn.

This does not happen when I use other datasets from the same member node so maybe just a broken url for that particular dataset?

One problem I found and fixed is that there is a difference between the header documentation on the XML doc between eml versions. So while this works fine for some eml versions it fails to detect other versions and than think they are ISO:

if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == FALSE) { warning("Metadata is in ISO format") new_dir <- download_ISO_data(meta_raw, meta_obj, meta_id, data_id, metadata_nodes, mn, path = path) } else if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == TRUE) { warning("Metadata is in EML format") new_dir <- download_EML_data(meta_obj, meta_id, data_id, metadata_nodes, mn, path = path) }

So I just simplified the string grepl was looking for to "ecoinformatics.org". Now I'll just have to also test a few more ISO cases to make sure that does not miraculously appear in the raw metadata of an iso xml.

I fixed a few bugs and now it works fine except it is printing some extra messages I don't understand so trying to sort that out. Here is the printed messages running an eml data url through the new package:

**https://pasta.lternet.edu/package/data/eml/edi/853/1/1e02df107f9a4d5045bff3e4440ee202 is the latest version for identifier https://pasta.lternet.edu/package/data/eml/edi/853/1/1e02df107f9a4d5045bff3e4440ee202

Downloading metadata https://pasta.lternet.edu/package/metadata/eml/edi/853/1 ... Download metadata complete Metadata is in EML formatNew names:

Downloading data https://pasta.lternet.edu/package/data/eml/edi/853/1/1e02df107f9a4d5045bff3e4440ee202 ... Download complete**

The "New names:" bit is the unexpected part of that print out. The data and metadata all download as expected though.

kristenpeach commented 3 years ago

Progress

Tried to download data from a few more datasets with my new functions. Most went well but some did not.

The Gulf of Alaska data portal has some datasets with weird pids. I noted one in an update above but I will note it again here: https://search.dataone.org/view/df35b.298.15

This other dataset (https://search.dataone.org/view/urn%3Auuid%3A3249ada0-afe3-4dd6-875e-0f7928a4c171) had normal looking pids but I got an interesting error associated with my member node loop when I tried to download data from it

"Error in .local(x, ...) : get() error: Hazelcast Instance is not active!"

When I set the data url () and run the SMALL_download_d1_data.R line by line the error happens in the member node loop (as I expected) in lines 88-101.

I thought this was a member node issue but I think it's a memory issue:

https://community.atlassian.com/t5/Bitbucket-questions/what-causes-Hazelcast-instance-to-become-inactive/qaq-p/80060

https://stackoverflow.com/questions/23293072/suddenly-im-getting-hazelcast-instance-is-not-active

I was having so many problems with the eml package on the server I have been using my local R and it seems like it may be a memory issue? So I tried again working on R studio on aurora and got the same error about downloading eml (see updates above). So I cleared up a bunch of memory on my laptop and tried again on my local R and got the same Hazelcast Instance error...some pages online are saying I should just wait a few minutes and try again but I have tried a few times, even after terminating and restarting my R studio session and clearing my cache. I have been trying to understand the help pages I linked above to solve the problem but I am out of my depth here.

When I run my download_d1_data function on datasets that have previously worked fine it still works fine so I think the problem may be specific to the nodeid/member node I was trying to use as input for getObject. I did notice it was a new member node I had not seen before ("urn:node:mnUCSB1") and that the data_nodes list and the metadata_nodes list do not match, which I am sure is a problem. The more I look at the member node loop the less sure I am that it is doing what I think it's doing.

If you load the tidyverse and have the metajam::check_version function, metajam::utils function, metajam::tabularize_eml function, download_EML_data.R function and SMALL_download_d1_data.R function you can run this code and see the problem:

library(tidyverse) path_folder <- "Data_test_Gulf_of_alaska"

# URL to download the dataset from DataONE data_url <- "https://cn.dataone.org/cn/v2/resolve/urn%3Auuid%3Aae595730-172a-43d0-91f8-3173663d7dce" dir.create(path_folder, showWarnings = FALSE)

# Download the dataset and associated metdata data_folder <- SMALL_download_d1_data(data_url = data_url,path = path_folder)

Compare that to this which runs fine:

library(tidyverse) path_folder <- "Data_test_eml"

# URL to download the dataset from DataONE data_url <- "https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F853%2F1%2F1e02df107f9a4d5045bff3e4440ee202" dir.create(path_folder, showWarnings = FALSE)

# Download the dataset and associated metdata data_folder <- SMALL_download_d1_data(data_url = data_url,path = path_folder)

I think the root of the issue is the member node loop. I think if we really want people to be using the data_url from DataOne, EDI, KNB or ADC we should maybe force that within the function. Right below the first line of code chunk below is the place we could do that. If we pull all possible instances of the data but then select only the member nodes we know work well ("urn:node:KNB", "urn:node:ARCTIC", "urn:node:EDI") from that list...than we may be able to skip the loop?

Lines 54-57 (https://github.com/kristenpeach/metajam/blob/master/R/SMALL_download_d1_data.R)

data_nodes <- dataone::resolve(dataone::CNode("PROD"), data_id) d1c <- dataone::D1Client("PROD", data_nodes$data$nodeIdentifier[[1]]) all_mns <- c(data_nodes$data$nodeIdentifier)

mbjones commented 3 years ago

@kristenpeach Hazelcast is a software component that we use in DataONE and on some of our repositories, including the Gulf of Alaska Data Portal, the KNB, and the Arctic Data Center, among others. Hazelcast errors like you see above are indicators of a big problem on the repository and are not likely to be specific to one dataset. Let's get on slack with some of the devs and see what's up there.

taojing2002 commented 3 years ago

I increased the max memory allocation for tomcat from 2G to 4G. Then restarted tomcat.

kristenpeach commented 3 years ago

@mbjones Oh interesting! Thank you for jumping in, it looks like I was not going to solve that on my own. Thank you @taojing2002 !

kristenpeach commented 3 years ago

Progress

I figured out (at least one of the reasons) that the function was working for some data packages with ISO metadata and not others. It looks like ISO metadata is not parsed exactly the same each time, so the 'place' where I found metadata version info for one data package ("doc.children.MD_Metadata.children.metadataStandardName.children.CharacterString.children.text.value") is not the same 'place' it is listed in others.

meta_iso_xml <- XML::xmlTreeParse(meta_raw)

metadata2 <- meta_iso_xml %>% unlist() %>% tibble::enframe()

ISO_type <- metadata2 %>% filter(name == "doc.children.MD_Metadata.children.metadataStandardName.children.CharacterString.children.text.value")

metadata <- metadata %>% mutate(value = ifelse(name == "@type", ISO_type$value ,value ))

Even when I ask it to look for something less specific like:

ISO_type <- metadata2 %>% filter(name %in% "metadataStandardName")

It often fails to find that.

In the 'main' function SMALL_download_d1_data.R I already have those lines of code that decide if the metadata is in ISO or eml. So we could just say anything that is passed to the download_ISO_data.R function has an xml.version of 'ISO' and anything that passes to the download_EML_data.R function has an xml.version of 'eml'.

But this was a good exercise because I am realizing that some of the other fields are not finding the info they are looking for either because of slight differences in the iso xml 'location'. I think I can improve this slightly by making the filters less specific (%in% instead of ==) but I can at least write a warning message saying something about how some summary metadata may be absent if the metadata is this type of iso

When I test the function sometimes it fails because it cannot retrieve a metadata_obj from the member node selected from the list. I think it would be helpful to write up a full issue on this to go along with an informative error message so that the usr can try manually setting their mn to one of the alternatives in all_mns and try again. Mining some clues for how to proceed from some arcticdatautils functions. I'm wondering if we should write another function that lives outside of SMALL_download_d1_data.R and download_ISO_data.R (like utils.R) that checks whether an mn is valid. That way we could insert it into SMALL_download_d1_data.R with a message directing them how to try a different mn.

When I revert to the old way of finding the right mn (d1c@mn) it works fine. I'm wondering if we should just go back to that way and then just add a thorough error message if it fails directing the user to an issue on the Github with instructions for how to manually set the mn. I'm sure there is a more sophisticated way to try each mn programatically though so I'll look into that more first