DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

Improve `getDataPackage` performance #283

Open gothub opened 2 years ago

gothub commented 2 years ago

Obtaining a local copy of a large package with getDataPackage can be very slow for packages with many members.

To improve performance:

A new, different function may be needed if getDataPackage can't be update to provide the required functionality.

@jeanetteclark @amoeba comments?

amoeba commented 2 years ago

Thanks for writing this up @gothub.

It feels like getDataPackage, even with lazyLoad=FALSE just does too much work right out of the gate. Even for a small (~10 object) package, getDataPackage is slower than I'd prefer, taking about 15-20 seconds on my machine to complete.

It looks like most of the time is spent in getDataObject. Even with lazyLoad=TRUE, getDataPackage will call getDataObject on each member, which subsequently calls resolve, listNodes, and getMNode, and getSystemMetadata. DataONE's flexible DataPackage model and federated nature make some of this work hard to avoid and the function calls I listed are part of dealing with that complexity.

Could we find a way to be even lazier here and not call getDataObject immediately for each member and only make that call when we have to? Some of the common use cases for getDataPackage might be:

What's the use case for eagerly fetching all of the information in the package? The above example use cases don't really need all the info right away.

jeanetteclark commented 2 years ago

Thanks for writing this up @gothub, and for adding your comments @amoeba. The use cases Bryce outlined are consistent with what we routinely do. The first one in the list is what I was working on today. I agree it doesn't seem like all of that information needs to be grabbed right away.

mbjones commented 2 years ago

Those seem like good use cases. Also, at least at one point we were doing some reasonable amount of caching, so calls to things like listNodes() didn't actually make REST calls over the net -- they just used a cached result from a previous invocation. When we rewrote the package to be independent of the Java classes, some of that caching may not have been reimplemented. Generally speaking, node lists only need to be retrieved once per session, and MN and CN instances could easily be cached as well. That would make it easier for clients so they don't have pay attention to whether calling getMNode 100 times for 100 files is efficient or not. getMNode returns a handle for the node, and its rare if the metadata would change over multiple months. So, in addition to lazy loading, effective caching can make a lot of this much faster too.

gothub commented 2 years ago

@mbjones @amoeba These are all great points.

The initial design was to create an accurate local representation of a DataPackage, that would support known and unanticipated use cases. Loading the resmap, sysmeta for all members, and either data or dataUrl for each member would support that. If we decide to make changes such as not loading sysmeta for members or similar, we should determine that potential use cases aren't excluding with any changes.

Regarding caching info from D1 calls like listNode, where would you recommend that this info is cached?

mbjones commented 2 years ago

Cache data should follow R's conventions. Existing packages for cacheing like memoisewill probably solve most of our use cases. A great overview is: https://blog.r-hub.io/2021/07/30/cache/

gothub commented 2 years ago

Can it safely be assumed that all objects in a data package are located on the MN that is being referenced in the D1client object, for example, KNB in this snippet:

d1cKNB <- D1Client("PROD", "urn:node:KNB")
pkg <- getDataPackage(d1cKNB, identifier=pkgId, lazyLoad=TRUE, quiet=TRUE, checksumAlgorithm=sha256)
mbjones commented 2 years ago

I don't think you can assume that all package members are in the same repository, even though that will likely be the case 99.9% of the time. That said, we should also be able to work around that with an algorithm like this:

  1. Assume all package members are on the repository, and try to download from that repository when needed
  2. Upon failure to get an object, call cn.resolve to get the location information and possibly the sysmeta
  3. Proceed to try to download from the authoritative location, then alternative locations

This should work with a single rest call for the vast majority of objects, but only call resolve when needed for exceptional circumstances. Do you think this could help, @gothub?

gothub commented 2 years ago

Yes, that sounds good, I'll implement that approach.

gothub commented 2 years ago

The getDataObject() method was updated as described above.

The following test was run on a 400 member test package both before and after the update:

library(dataone)
d1c <- D1Client("STAGING", "urn:node:mnTestARCTIC")
ptm <- proc.time()
pkg <- getDataPackage(d1c, id="urn:uuid:b77807fc-7fc5-418d-a50e-75e9d268cdce", lazyLoad=T, quiet=F)
proc.time() - ptm

For the test performed before the update, the difference from proc.time():

 user  system elapsed
 431.279 103.031 738.021     <<< this is equivalent to 12.3 minutes

This is the time after the update:

    user  system elapsed
 146.191   4.579 195.478      <<< this is equivalent to 3.25 minutes
jeanetteclark commented 2 years ago

A dramatic improvement thank you Peter! I think there is still some room for performance improvement but it will take more thought so we can keep this issue closed for now and perhaps I'll open a new issue with a more concrete idea later down the line.

gothub commented 2 years ago

Good point, we can continue to use this issue for downloading/uploading packages, etc.