DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

How to download data packages without getPackage? #201

Closed LiamBurke24 closed 6 years ago

LiamBurke24 commented 7 years ago

Hi @all, I have been trying to develop a download by doi function from dataone for the past few months. Much to my dismay, this process has been considerably more troublesome than I had originally anticipated.

I am very grateful to @gothub for his help along the way, but I am still left somewhat up a creek. Peter pointed me towards the getPackage function as the best way to download a package of data from dataONE, and when it works it truly is. Much to my dismay, there are only a handful of Member Nodes that have adopted this optional function making it more or less dead on arrival. I am reaching out to see if there is another way that I can download the files in a data package (metadata and data) without the getPackage function. Many thanks! I will show some previous attempts below:

I was hoping to download this dataset from LTER (which hasn't upgraded to getPackage yet).

id <- "doi:10.6073/pasta/63ad7159306bc031520f09b2faefcf87" 
PEcAn.data.land::id.resolveable(id) #function that puts the doi in solr format for the query below

library(dataone) 
cn <- CNode("PROD")
queryParams <- list(fq=doi1) 
result <- query(cn, solrQuery=queryParams, as="data.frame", parse=FALSE)
result

pid <- result[1,'id'] 

locations <- resolve(cn, pid)
mnId <- locations$data[1, "nodeIdentifier"] # your original code points to the second row but that points to the wrong MN
mn <- getMNode(cn, mnId)

obj <- getObject(mn, pid)
metadataXML <- rawToChar(obj)

The result is the following user unfriendly XML in list format:

[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<rdf:RDF\n   xmlns:cito=\"http://purl.org/spar/cito/\"\n   xmlns:dc=\"http://purl.org/dc/elements/1.1/\"\n   xmlns:dcterms=\"http://purl.org/dc/terms/\"\n   xmlns:foaf=\"http://xmlns.com/foaf/0.1/\"\n   xmlns:ore=\"http://www.openarchives.org/ore/terms/\"\n   xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"\n   xmlns:rdfs1=\"http://www.w3.org/2001/01/rdf-schema#\"\n>\n  <rdf:Description rdf:about=\"https://cn.dataone.org/cn/v1/resolve/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Freport%2Feml%2Fknb-lter-hfr%2F103%2F29\">\n 

Is there no other way to download data that returns output like the getPackage function?

gothub commented 7 years ago

@LiamBurke24 You will be able to use getDataPackage() from the dataone package version that we plan to have released by the end of next week. If you want to try this release out before then, you can get a development version by entering the commands:

install.packages("devtools")
library(devtools)
install_github("ropensci/datapack")
install_github("DataONEorg/rdataone")
library(dataone)

Here is a console session showing how the data package can be downloaded, using the development release:

> pkg <- getDataPackage(d1c, id="doi:10.6073/pasta/63ad7159306bc031520f09b2faefcf87", lazyLoad=FALSE, quiet=F)
Downloading package members for package with metadata identifier: https://pasta.lternet.edu/package/metadata/eml/knb-lter-hfr/103/29
Downloaded object at URL https://gmn.lternet.edu/mn/v2/object/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fknb-lter-hfr%2F103%2F29
Downloaded object at URL https://gmn.lternet.edu/mn/v2/object/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-hfr%2F103%2F29%2Fc3311fbfd7ff6b8691fc1133b96d36c1
Downloaded object at URL https://gmn.lternet.edu/mn/v2/object/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-hfr%2F103%2F29%2F29b5d66b99f311eb3b03e3243f606d27
Downloaded object at URL https://gmn.lternet.edu/mn/v2/object/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Freport%2Feml%2Fknb-lter-hfr%2F103%2F29
Downloaded object at URL https://gmn.lternet.edu/mn/v2/object/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-hfr%2F103%2F29%2Fc0a74bcd66c91627aaa60336ab777891
Downloaded object at URL https://gmn.lternet.edu/mn/v2/object/https:%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fknb-lter-hfr%2F103%2F29%2F68d4bfc0c08bccc429f890b563d8587e
Getting resource map with id: doi:10.6073/pasta/63ad7159306bc031520f09b2faefcf87

The data is still wrapped in each DataPackage member as raw data, so first you need to see what type of data each package member contains using getValue:

> library(datapack)
> getValue(pkg, name="sysmeta@formatId")
$`https://pasta.lternet.edu/package/data/eml/knb-lter-hfr/103/29/29b5d66b99f311eb3b03e3243f606d27`
[1] "text/csv"

$`https://pasta.lternet.edu/package/data/eml/knb-lter-hfr/103/29/68d4bfc0c08bccc429f890b563d8587e`
[1] "text/csv"

$`https://pasta.lternet.edu/package/data/eml/knb-lter-hfr/103/29/c0a74bcd66c91627aaa60336ab777891`
[1] "text/csv"

$`https://pasta.lternet.edu/package/data/eml/knb-lter-hfr/103/29/c3311fbfd7ff6b8691fc1133b96d36c1`
[1] "text/csv"

$`https://pasta.lternet.edu/package/metadata/eml/knb-lter-hfr/103/29`
[1] "eml://ecoinformatics.org/eml-2.1.0"

$`https://pasta.lternet.edu/package/report/eml/knb-lter-hfr/103/29`
[1] "text/xml"

Next, you can extract one of the package members:

> pkgMember <- getMember(pkg, 'https://pasta.lternet.edu/package/data/eml/knb-lter-hfr/103/29/68d4bfc0c08bccc429f890b563d8587e')
> data <- getData(pkgMember)
> writeLines(rawToChar(data), "myData.csv")

Please let me know if this helps with your download doi function.

LiamBurke24 commented 7 years ago

This looks VERY helpful. Thank you so much @gothub!!!

LiamBurke24 commented 7 years ago

That all went perfectly except for the last writeLines() command. Not sure where it is putting the data. Thanks!

gothub commented 7 years ago

Yes, so that went to the current directory. A better way is to use the R session temp directory, using tempdir() which returns the location:

tf = tempdir()
writeLines(rawToChar(data), paste0(tf, "/", "myData.csv"))
LiamBurke24 commented 7 years ago

Perfect! Thank you! This is a really really beautiful script. Can't wait for it to come out!

LiamBurke24 commented 7 years ago

Is there a way that I can sign up to be notified of the new release? So excited.

gothub commented 7 years ago

@LiamBurke24 I don't know of a notification mechanism for releases, but you can always check cran.r-project.org/package=dataone and look out for the 2.1.0 release.

LiamBurke24 commented 7 years ago

Great! thank you!

LiamBurke24 commented 7 years ago

Hey @gothub, just following up on that new release! Do you have a new estimate about when that will be coming out? Thanks!

gothub commented 7 years ago

@LiamBurke24 The new release was submitted to CRAN but doesn't appear to be available yet, see https://cran.r-project.org/web/packages/dataone/index.html. I'm out of the office now but will look into this when I can.

LiamBurke24 commented 7 years ago

@gothub OK! thank you so much!

LiamBurke24 commented 7 years ago

hey @gothub, just checking in on that release. Haven't seen any changes on CRAN...

gothub commented 7 years ago

@LiamBurke24 I was out last week, just returning to this today. We are in the middle of the CRAN package review process, which may take a couple more days until this package version has been accepted. I'll post to this issue when the new version has been accepted.

LiamBurke24 commented 7 years ago

I really appreciate that. Thanks so much @gothub!

gothub commented 7 years ago

@LiamBurke24 Please note that the new version of dataone (2.1.0) is now available.