DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

Allow download of object if specify path in getObject #208

Closed maier-m closed 6 years ago

maier-m commented 6 years ago

To facilitate efficient and reliable data management, it would be very helpful if getObject would allow users to download the object to a given folder path. This is critical for files like excel files in which rawToChar commands fail in R effectively leaving users without an efficient way of accessing such files directly from a DataONE node.

amoeba commented 6 years ago

Hey @maier-m, I feel similarly to @mbjones in terms fo fit since the dataone package was meant to be a 1:1 mapping to the DataONE API. But you're making a fair point that it's confusing to get an Object from DataONE onto your computer.

To address your usability concern, would it be helpful to know the rawToChar(getObject(...)) only works in some cases and that writeBin(getObject(...), save_path) is a general-purpose, object-format-independent way to save a DataONE Object to disk?

If we implemented this ability as part of the https://github.com/ropensci/datapack package, do you think this would be a good API?

my_obj <- DataObject(node, pid)
writeObject(my_obj, my_path)

or something similar?

maier-m commented 6 years ago

Yes @amoeba. I considered this, but I think writeBin fails if data files get too large.

cn <- CNode('PROD')
mn <- getMNode(cn,'urn:node:ARCTIC')
x <- dataone::getObject(mn, "urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e")
path <- ""
writeBin(x, path)

We can use httr for public files. So altogether there are good approaches, but the idea here was an approach that would work for all files. Although, I do agree, your proposed implementation is much cleaner, maybe there is a workaround for larger files.

amoeba commented 6 years ago

I think writeBin fails if data files get too larg

Hrm, that's either a bug in Metacat/DataONE or in R. What's the failure code?

We can use httr for public files

You can also use httr for private files. In fact, that's how the dataone package works.

library(httr)
r <- GET("http:/arcticdata.io/metacat/d1/mn/v2/object/urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e", 
         add_headers(Authorization = paste("Bearer", your_token))

I'm testing the writeBin method on your large file just to see.

Stepping back, what do you think about @mbjones' suggestion about the datapack DataObject class and implementing this there?

maier-m commented 6 years ago

Great point about httr. We could specify tokens directly. Its nice to use the dataone package which handles that for you automatically but not necessary I suppose.

amoeba commented 6 years ago

For sure.

I see your error:

writeBin(dataone::getObject(mn, "urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e"), out)
Error in writeBin(dataone::getObject(mn, "urn:uuid:39968282-a538-4903-9f31-8c42f0f5ea9e"),  :
  long vectors not supported yet: ../../../../R-3.4.4/src/main/connections.c:4153

Looks like there's a 2GB limit to writeBin.

What do you think about the proposed DataObject way?

gothub commented 6 years ago

For rdataone, the D1Client class has methods that are built on the DataONE API methods. This class has the getDataObject() method that downloads an object from DataONE and constructs a DataObject, so you could have:

d1c <- D1Client("PROD", "urn:node:KNB")
obj <- getDataObject(d1c, "mypid")
writeData(obj, "mydata.csv")

The DataObject that is downloads has the sysmetadata, so could use the mediaType and fileName if necessary in determining how to write out the file.

This could be shortened to

writeData(d1c, "mypid", "mydata.csv")

This method would download the object and write it out to disk, using the sysmetadata if needed.

maier-m commented 6 years ago
gothub commented 6 years ago

If a new function is added to datatone or datapack, it would determine the best way to write out the file. I'm not sure what the alternative to writeBin would be, but the new function could use another method to write out large files.

maier-m commented 6 years ago

I may be mistaken, but it seems like running

httr(url,
     add_headers(Authorization = paste("Bearer", token),
     httr::write_disk(path))

is the best option for reliably downloading an object from dataone (instead of getting the binary then writing the binary which R struggles with on large files).

Does if make sense to have a function downloadData(node, pid, path)? I think this is helpful but only if it works for all data objects.

gothub commented 6 years ago

@maier-m downloadObject(node, pid, path) looks fine to me

mbjones commented 6 years ago

Regardless of how this is implemented, we still haven't answered the question of which package this function belongs in. I think that is fundamental to what it does and what its parameters are. For example, in datapack, I would expect downloadObject(pid, path) would be more appropriate, because datapack doesn't know about nodes. Within the dataone package, we already have D1Client::getDataObject(d1client, pid), which I think could be overloaded to D1Client::getDataObject(d1client, pid, filepath). This has the advantage of keeping the system metadata bundled, and can use the DataObject mechanism for storing the file locally in the filesystem rather than in memory. This would be accomplished by downloading the file to disk locally, and then setting the filename parameter to the DataObject contructor when calling new().

maier-m commented 6 years ago

I agree with your points Matt, but maybe am still confused by the architecture. I believe getDataObject calls getObject to get the data and so I would still have to add a filepath variable to getObject as I did before or create some much more complex workaround.

I think this function belongs in dataone as the dataone package description is

The R package dataone provides read/write access to data and metadata

as opposed to

The datapack R package provides an abstraction for collating heterogeneous collections of data objects and metadata into a bundle that can be transported and loaded into a single composite file.

for datapack.

mbjones commented 6 years ago

@maier-m I agree with your logic that this belongs in dataone, but still would rather not see it as part of getObject's signature because of the API changes. It could however be invisibly implemented there and that would be fine. Overriding D1Client::getDataObject to include an optional path seems better to me.

Regarding saving files using their PIDs as filenames -- we do that in some apps, but you'll need to mangle the PID to use it as a filename because some PIDs contain characters tha are not legal in names (such as '/' and ':' on some OSes). We have some PID-to-filename functions in some of our other code for reference.

maier-m commented 6 years ago

@mbjones, Can you explain what you mean by invisibly implemented?

I understand the desire to use D1Client::getDataObject but am confused by:

  1. The fact that the D1Client::getDataObject calls getObject so I still think I would still need the path command in getObjectunless I significantly rewrote D1Client::getDataObject
  2. Is the goal to download a full data object? My goal was to just download the data file alone without any metadata.
gothub commented 6 years ago

The method param lists getObject", signature("MNode"), function(x, pid, check=as.logical(FALSE), path = NULL passes tests on macOS and windows 7 (which have the old method signature).

I'd like to go ahead and approve this, if @mbjones and @amoeba concur. If that be the case, then i'd like to make the following changes with or after the merge:

Also, although it would be clearer to the user to fetch the sysmeta and use sysmeta$fileName (if it is defined) to determine the resulting output filename, this is not consistent with the DataONE MN API, so should not be used.

amoeba commented 6 years ago

Thanks @gothub. So the change was just to add a path arg that saves the fetched Object to the path? Sounds good to me. Is this also going to work for CNodes and /resolve?

gothub commented 6 years ago

@amoeba yeah, good point - the CNode getObject method needs to be updated as well. The resolve method just returns a data.frame, so this shouldn't be affected.

amoeba commented 6 years ago

Sounds good to me!