NCEAS / metajam

Bringing data and metadata togetheR
https://nceas.github.io/metajam/
Apache License 2.0
16 stars 7 forks source link

Adding support of Gulf of Alaska Data Portal #132

Closed brunj7 closed 7 months ago

brunj7 commented 3 years ago

the LTER NGA site is using a specific data repository: https://gulf-of-alaska.portal.aoos.org/#

We need to add the support of it to metajam. This data repository is part of DataOne so we can still rely on its API to access the content. The biggest change is this repository is not supporting EML metadata standard.

We will thus to:

Data for test: https://doi.org/10.24431/rw1k45w

This should be done on the dev branch.

kristenpeach commented 3 years ago

Progress

Possible message draft for member node glitches: Data packages are often replicated on multiple member nodes. Some examples of member nodes include Atmospheric Radiation Measurement Data Center ("urn:node:ARM"), IEDA: Interdisciplinary Earth Data Alliance ("urn:node:IEDA_EARTHCHEM"), Nevada Research Data Center ("urn:node:NRDC"), and Knowledge Network for Biocomplexity ("urn:node:KNB"). Sometimes a dataset becomes unavailable on one of the several member nodes that host a copy of it. This can make download_d1_data.R fail because it tries the first member node listed as a possible member node when downloading the data. If you attempt to use the download_d1_data.R function to retrieve data from a DataOne repository and the function halts and you get one of the following errors:

Insert example of error(s) associated with this problem here

Then maybe adding some line like: If you encounter this problem please generate a new issue on the metajam Github page and provide a minimum reproducible example of how you attempted to you download_d1_data.R. Please be sure to include the data url that you use so we can track down the member node that is no longer operational (for that data set) and remove it.

Below I have tried to figure out a way to make an option for the user to manually set the mn to a different node but it would basically turn into them manually running each line of the function. I think it may be a safer bet to just include the warning message instead. 'You may need to try to retrieve the data from an alternative member node by setting the member node manually.'

Then I need to turn this code chunk into the most minimal version of itself. data_url <- utils::URLdecode(data_url) data_versions <- check_version(data_url, formatType = "data")

if (nrow(data_versions) == 1) { data_id <- data_versions$identifier } else if (nrow(data_versions) > 1) {

get most recent version

data_versions$dateUploaded <- lubridate::ymd_hms(data_versions$dateUploaded)
data_id <- data_versions$identifier[data_versions$dateUploaded == max(data_versions$dateUploaded)]

} else { stop("The DataONE ID could not be found for ", data_url) }

Set Nodes ------------

data_nodes <- dataone::resolve(dataone::CNode("PROD"), data_id) all_mns <- c(data_nodes$data$nodeIdentifier) cn <- dataone::CNode() meta_id <- dataone::query( cn, list(q = sprintf('documents:"%s" AND formatType:"METADATA" AND -obsoletedBy:*', data_id), fl = "identifier")) %>% unlist()

Generate list of all member nodes that 'host' this data packages by using the meta_id and

metadata_nodes <- dataone::resolve(cn, meta_id) mn <- dataone::getMNode(cn, "urn:node:RW")

Spun my wheels a little bit because I think there were so many different errors I need a cohesive list of the circumstances in which the current version of the function fails but got back on track.

Worked on making the download_ISO_data.R function work with a wider variety of iso xml. The field names are so long but if I can have it look for at least some of the most common field names I think that will be better. I'm talking about this section of the code for download_ISO_data.R:

metadata <- metadata %>% dplyr::mutate(name = dplyr::case_when( grepl("@type", name) ~ "xml.version", grepl("title", name) ~ "title", grepl("individualName", name) ~ "people", grepl("abstract", name) ~ "abstract", grepl("identificationInfo.MD_DataIdentification.descriptiveKeywords.MD_Keywords.keyword.CharacterString", name) ~ "keyword", grepl("doc.children.MD_Metadata.children.metadataStandardName.children.CharacterString.children.text.value", name) ~ "Metadata_ISO_Version", grepl("geographicDescription", name) ~ "geographicCoverage.geographicDescription", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.westBoundLongitude.Decimal", name) ~ "geographicCoverage.westBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.eastBoundLongitude.Decimal", name) ~ "geographicCoverage.eastBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.northBoundLatitude.Decimal", name) ~ "geographicCoverage.northBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.southBoundLatitude.Decimal", name) ~ "geographicCoverage.southBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.temporalElement.EX_TemporalExtent.extent.TimePeriod.beginPosition", name) ~ "temporalCoverage.beginDate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.temporalElement.EX_TemporalExtent.extent.TimePeriod.endPosition", name) ~ "temporalCoverage.endDate", grepl("dataQualityInfo.DQ_DataQuality.report.DQ_ConceptualConsistency.evaluationMethodDescription.CharacterString", name) ~ "methods", grepl("objectName", name) ~ "objectName", grepl("online.url", name) ~ "url", grepl("dataQualityInfo.DQ_DataQuality.lineage.LI_Lineage.statement.CharacterString", name) ~ "methods" )) %>% dplyr::filter(!is.na(name)) %>% dplyr::mutate(value = stringr::str_trim(value)) %>% dplyr::distinct() %>% dplyr::group_by(name) %>% dplyr::summarize(value = paste(value, collapse = "; "), .groups = "drop") %>% dplyr::mutate(value = gsub("\n", "", value))

The df version of the iso xml has these super long and specific field names that correspond to the fields we need like xml version and some are retrieved from the meta_iso_xml object and some from the 'eml' object which is the iso metadata coerced into eml by as_emld.

Now I can't replicate the error I was getting associated with this problem and the function is working fine again?! Going to start with a clean slate tomorrow and try to use the function on several data packages from several member nodes and see if I can get a list

mbjones commented 3 years ago

@kristenpeach it is fairly easy to instruct dataone to try each of the replica copies on DataONE until one is found that does not fail. I'm pretty sure the dataone package is supposed to try all replicas before it fails, as shown in https://github.com/DataONEorg/rdataone/pull/266 and https://github.com/DataONEorg/rdataone/issues/228

If you can produce a reprex for situations where we are not trying all of the replicas, this problem can likely be fixed pretty easily.

kristenpeach commented 3 years ago

Thank you @mbjones !!!

kristenpeach commented 3 years ago

Progress

Tried the new functions on a variety of member nodes and using data urls from data packages with both eml and iso xml files.

ISO

Member node: Research Workspace

Lisa Eisner and Michael Lomas. Phytoplankton identifications in the northern Bering and Chukchi seas, quantified with FlowCAM image analysis, Arctic Integrated Ecosystem Research Program, August-September 2017. Research Workspace. 10.24431/rw1k5ac, version: 10.24431_rw1k5ac_20210709T212354Z.

Jens Nielsen, Louise Copeman, Michael Lomas, and Lisa Eisner. Fatty acid seston samples collected from CTD samples in N. Bering and Chukchi Seas during Arctic Integrated Ecosystem Research Program, from research vessel Sikuliaq June 2017. Research Workspace. 10.24431/rw1k59z, version: 10.24431_rw1k59z_20210708T234958Z.

Member node: Arctic Data Center (via DataOne)

William Daniels, Yongsong Huang, James Russell, Anne Giblin, Jeffrey Welker, et al. 2021. Soil Water, plant xylem water, and leaf wax hydrogen isotope survey from Toolik Lake Area 2013-2014. Arctic Data Center. doi:10.18739/A2S17ST50.

Caitlin Livsey, Reinhard Kozdon, Dorothea Bauch, Geert-Jan Brummer, Lukas Jonkers, et al. 2021. In situ Magnesium/Calcium (Mg/Ca) and oxygen isotope (d18O) measurements in Neogloboquadrina pachyderma shells collected in 1999 by a MultiNet tow from different depth intervals in the Fram Strait. Arctic Data Center. doi:10.18739/A2WS8HN0X.

Member node: KNB (via DataOne)

Darcy Doran-Myers. 2021. Data: Density estimates for Canada lynx vary among estimation methods. Knowledge Network for Biocomplexity. urn:uuid:e9dc43c2-210f-40dc-86fb-a6ece2f5fd03.

This one does not work and it fails within the download_EML_data.R function at lines 38-40

entity_data <- entity_objs %>% purrr::keep(~any(grepl(data_id, purrr::map_chr(.x$physical$distribution$online$url, utils::URLdecode))))

The message the user gets is that "Input does not appear to be an attributeList." but that is because the entity_data object is empty because the line above does not produce anything. When I inspect this dataset on the web interface (https://search.dataone.org/view/urn%3Auuid%3Ae9dc43c2-210f-40dc-86fb-a6ece2f5fd03) it looks like there should be an attribute list for this dataset. Happily, this problem does not seem to have anything to do with member nodes.

kristenpeach commented 3 years ago

Progress

It looks like this appears to be a problem for all or many data packages on KNB. By that I mean when I try the SMALL_download_d1_data.R package on any data url from a KNB dataset it fails with the same error about the data table lacks an attribute list. Because KNG uses eml I can test the original download_d1_data.R function to see if the problem is old or new. When I use the original download_d1_data.R function on the same data urls the function runs through but it fails to produce an attribute level metadata table and 'fails' at the same place my function does but just keeps running to produce the summary metadata. These feels like a good problem to work on.

At first glance when I compare the ADC dataset that worked great (https://search.dataone.org/view/doi%3A10.18739%2FA2S17ST50) and the KNB one that is failing (https://search.dataone.org/view/doi%3A10.5063%2FR78CMB) there are a couple differences on the web interface alone. The ADC attribute info has annotations which is lovely but I would not assume they are required for metajam to work. The KNB csv file I was trying to download is stored as an OtherEntity instead of a datatable. But I am not sure that is what would be causing problems either. The problem is I don't understand what the line that are failing actually do (see not about lines 38-40 above). Spending some time trying to understand purrr better so I can figure it out. I will go through the function line by line with a data url that IS working to see what those lines are supposed to do which should help me figure out why its failing for KNB.

AHA. When I compare the successful data package to the unsuccessful one they diverge here:

entity_data <- entity_objs %>%
  purrr::keep(~any(grepl(data_id,
                         purrr::map_chr(.x$physical$distribution$online$url, utils::URLdecode))))

Because in the unsuccessful (KNB) package the dataset of interest within the entity_objs list lacks a 'physical' slot. So purrr can't find the url and does not keep the object. Fun!

mbjones commented 3 years ago

stored as an OtherEntity instead of a datatable

@kristenpeach this is a likely issue, and I was about to suggest it when reading your comment. Metajam needs to look in all of the allowed locations in EML for attribute info, and not assume that all providers will use just datatable. I suspect its a simple fix by adding an additional path to be searched for the CSV entity info before you do that search for the attributes. I;ll bet entity_objs does not contain the entities that are described with otherrEntity, spatialVector, spatialRaster, etc.

kristenpeach commented 3 years ago

Thank you @mbjones ! I think the function does look for other entities here:

entities <- c("dataTable", "spatialRaster", "spatialVector", "storedProcedure", "view", "otherEntity") entities <- entities[entities %in% names(eml$dataset)]

entity_objs <- purrr::map(entities, ~EML::eml_get(eml, .x)) %>% # restructure so that all entities are at the same level purrr::map_if(~!is.null(.x$entityName), list) %>% unlist(recursive = FALSE)

But because the user is trying to download 1 data file (not all of the files in the package) those lines using the purrr package I noted above were looking for the specific file associated with the data_url the user provided so that it would just keep and download the data file of interest (and drop all others). It was throwing everything out though because the otherEntity file does not have a physical (so .x$physical$distribution$online$url was empty). Kind of tricky to explain. But here is the entity_objs list for the KNB otherentity:

Screen Shot 2021-07-15 at 2 30 34 PM

And here is the entity_obj of the ADC dataset (which works fine with metajam):

Screen Shot 2021-07-15 at 2 40 10 PM

So if I am understanding the problem correctly (big if) I think I can just tell R to only keep the items in the entity_objs list where the data_id and the "id" match (instead of looking for a match in the url). Then it should be able to identify the correct file even if it is a datatable or otherentity. Feel free to let me know if I am way off the mark.

mbjones commented 3 years ago

That sounds reasonable, although I haven't had a chance to look at the details. @jeanetteclark and @laijasmine have worked with these structures a lot and might have good suggestions and maybe some code....

laijasmine commented 3 years ago

I'm kind coming in with little to no context here - so happy to jump on a call if that would be more helpful. If I understand correctly, based on what you said:

I think I can just tell R to only keep the items in the entity_objs list where the data_id and the "id" match (instead of looking for a match in the url).

Yes you can get the data pid (data_id?) using the id slot in the entities. The one thing to note is that to avoid issues with the : character all of the urn:uuid: is replaced with the dashes -

jeanetteclark commented 3 years ago

There are a few different ways to match the data file with the metadata in a dataset, none of which are 100% guaranteed to work (it depends entirely on how well the metadata were constructed). I would try to match the pid in the following ways

  1. Data distribution URL
  2. @id
  3. entityName (match to system metadata fileName)

I believe this is how metacatUI operates, though I'm not sure if the order is the same or not.

kristenpeach commented 3 years ago

@mbjones @laijasmine @jeanetteclark Thank you everyone! I think I have plenty of options to try. I appreciate the help! Just a heads up that Julien and I have a system where I basically report my progress on this issue page every day. I'm not sure if Github emails everyone tagged in the issue every time I update it but that could get really annoying for you so sorry in advance!

kristenpeach commented 3 years ago

Progress

Spent wayy too much time trying to figure out how to find and replace a character string in a nested list. I wanted to replace "-" with ":" in the 'id' slot of each entry of the list (each entity) so that I could match it with the data_id. Then realized I could just swap them in the data id instead...

temp_data_id <- gsub("\\:", "\\-", data_id) entity_data <- entity_objs %>% purrr::keep(~any(grepl(temp_data_id, .x$id)))

Seems to work fine though! Thanks everyone! It's work on a couple datasets I've tried but I will keep trying more until I find the next hiccup.

This feels like something I should understand by now but I don't really get how those KNB data sets don't have a physical. I know that at ADC we sometimes had to 'set' the physical (sysmeta_to_eml_physical). So basically pull information already "known" in the sysmeta like file size into an 'eml physical object'. If someone submits data to the ADC and they do everything right on the web interface, and the data team does not have to fix anything, isn't the physical set automatically? Just trying to wrap my head around this. But when I dig into the schema to the otherEntity level (https://eml.ecoinformatics.org/schema/) it seems like there are a lot of fields that 'could' be there in addition to the basic entityName, entityDescription, attributeList. So is physical considered an 'optional' element of eml and KNB just happens not to use it?

laijasmine commented 3 years ago

Yeah the physical needs to be set manually by someone on the team when we process a dataset and it is not something set automatically. We also need to update the physical if the file is replaced (the info like the file name and size might be slightly different). So since no one is actively reviewing the datasets that come through the KNB, the physical isn't included in the metadata.

kristenpeach commented 3 years ago

Progress

Cleaned out some unused code from all the new functions. Worked on writing testthats for the download_EML_data.R and download_ISO_data.R functions that are called within download_d1_data.R. Not sure how many test cases are appropriate but I will do some testing data urls from different member nodes just because that has caused problems before.

Discussed with Julien and going to do pull request

eblondel commented 2 years ago

Dear all, discovering this issue with attempts to use geometa converters between metadata objects :-) In case you need to exchange on that, feel free to contact me or post an issue on geometa repository. The geometa converters were the results of some RnD activity under a project funded by R Consortium some years ago to consolidate geometa standards coverage and explore new bridges between metadata standards. It has been a while I didn't look into these converters but I would be happy to get into it again. Cheers

brunj7 commented 8 months ago

See PR #134