Closed maier-m closed 6 years ago
Hey @maier-m, I feel similarly to @mbjones in terms fo fit since the dataone package was meant to be a 1:1 mapping to the DataONE API. But you're making a fair point that it's confusing to get an Object from DataONE onto your computer.
To address your usability concern, would it be helpful to know the rawToChar(getObject(...))
only works in some cases and that writeBin(getObject(...), save_path)
is a general-purpose, object-format-independent way to save a DataONE Object to disk?
If we implemented this ability as part of the https://github.com/ropensci/datapack package, do you think this would be a good API?
my_obj <- DataObject(node, pid)
writeObject(my_obj, my_path)
or something similar?
Yes @amoeba. I considered this, but I think writeBin
fails if data files get too large.
cn <- CNode('PROD')
mn <- getMNode(cn,'urn:node:ARCTIC')
x <- dataone::getObject(mn, "urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e")
path <- ""
writeBin(x, path)
We can use httr
for public files. So altogether there are good approaches, but the idea here was an approach that would work for all files. Although, I do agree, your proposed implementation is much cleaner, maybe there is a workaround for larger files.
I think writeBin fails if data files get too larg
Hrm, that's either a bug in Metacat/DataONE or in R. What's the failure code?
We can use httr for public files
You can also use httr for private files. In fact, that's how the dataone
package works.
library(httr)
r <- GET("http:/arcticdata.io/metacat/d1/mn/v2/object/urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e",
add_headers(Authorization = paste("Bearer", your_token))
I'm testing the writeBin
method on your large file just to see.
Stepping back, what do you think about @mbjones' suggestion about the datapack
DataObject
class and implementing this there?
Great point about httr
. We could specify tokens directly. Its nice to use the dataone
package which handles that for you automatically but not necessary I suppose.
For sure.
I see your error:
writeBin(dataone::getObject(mn, "urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e"), out)
Error in writeBin(dataone::getObject(mn, "urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e"), :
long vectors not supported yet: ../../../../R-3.4.4/src/main/connections.c:4153
Looks like there's a 2GB limit to writeBin.
What do you think about the proposed DataObject
way?
For rdataone
, the D1Client
class has methods that are built on the DataONE API methods. This class has the getDataObject()
method that downloads an object from DataONE and constructs a
DataObject, so you could have:
d1c <- D1Client("PROD", "urn:node:KNB")
obj <- getDataObject(d1c, "mypid")
writeData(obj, "mydata.csv")
The DataObject that is downloads has the sysmetadata, so could use the mediaType
and
fileName
if necessary in determining how to write out the file.
This could be shortened to
writeData(d1c, "mypid", "mydata.csv")
This method would download the object and write it out to disk, using the sysmetadata if needed.
writeBin
limits?datapack
?If a new function is added to datatone
or datapack
, it would determine the best way to write out the file. I'm not sure what the alternative to writeBin
would be, but the new function could use another method to write out large files.
I may be mistaken, but it seems like running
httr(url,
add_headers(Authorization = paste("Bearer", token),
httr::write_disk(path))
is the best option for reliably downloading an object from dataone (instead of getting the binary then writing the binary which R struggles with on large files).
Does if make sense to have a function downloadData(node, pid, path)
? I think this is helpful but only if it works for all data objects.
@maier-m downloadObject(node, pid, path)
looks fine to me
Regardless of how this is implemented, we still haven't answered the question of which package this function belongs in. I think that is fundamental to what it does and what its parameters are. For example, in datapack
, I would expect downloadObject(pid, path)
would be more appropriate, because datapack doesn't know about nodes. Within the dataone
package, we already have D1Client::getDataObject(d1client, pid)
, which I think could be overloaded to D1Client::getDataObject(d1client, pid, filepath)
. This has the advantage of keeping the system metadata bundled, and can use the DataObject
mechanism for storing the file locally in the filesystem rather than in memory. This would be accomplished by downloading the file to disk locally, and then setting the filename
parameter to the DataObject contructor when calling new()
.
I agree with your points Matt, but maybe am still confused by the architecture. I believe getDataObject
calls getObject
to get the data and so I would still have to add a filepath variable to getObject
as I did before or create some much more complex workaround.
I think this function belongs in dataone as the dataone
package description is
The R package dataone provides read/write access to data and metadata
as opposed to
The datapack R package provides an abstraction for collating heterogeneous collections of data objects and metadata into a bundle that can be transported and loaded into a single composite file.
for datapack
.
@maier-m I agree with your logic that this belongs in dataone
, but still would rather not see it as part of getObject's signature because of the API changes. It could however be invisibly implemented there and that would be fine. Overriding D1Client::getDataObject
to include an optional path seems better to me.
Regarding saving files using their PIDs as filenames -- we do that in some apps, but you'll need to mangle the PID to use it as a filename because some PIDs contain characters tha are not legal in names (such as '/' and ':' on some OSes). We have some PID-to-filename functions in some of our other code for reference.
@mbjones, Can you explain what you mean by invisibly implemented
?
I understand the desire to use D1Client::getDataObject
but am confused by:
D1Client::getDataObject
calls getObject
so I still think I would still need the path command in getObject
unless I significantly rewrote D1Client::getDataObject
The method param lists getObject", signature("MNode"), function(x, pid, check=as.logical(FALSE), path = NULL
passes tests on macOS and windows 7 (which have the old method signature).
I'd like to go ahead and approve this, if @mbjones and @amoeba concur. If that be the case, then i'd like to make the following changes with or after the merge:
test_MNode.R
Also, although it would be clearer to the user to fetch the sysmeta and use sysmeta$fileName
(if it is defined) to determine the resulting output filename, this is not consistent with the DataONE MN API, so should not be used.
Thanks @gothub. So the change was just to add a path
arg that saves the fetched Object to the path? Sounds good to me. Is this also going to work for CNodes and /resolve?
@amoeba yeah, good point - the CNode getObject
method needs to be updated as well. The resolve
method just returns a data.frame, so this shouldn't be affected.
Sounds good to me!
To facilitate efficient and reliable data management, it would be very helpful if getObject would allow users to download the object to a given folder path. This is critical for files like excel files in which
rawToChar
commands fail in R effectively leaving users without an efficient way of accessing such files directly from a DataONE node.