DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

Method to review package without downloading it #253

Open gothub opened 4 years ago

gothub commented 4 years ago

@jeanetteclark please review

It should be easy to review the contents of a package without downloading it. This review should include information about the pids of the package, the package owner (submitter, rightsholder) and package permissions.

This function may just query the Solr store for a package to get this info, or may work in conjuction with datapack, for example, see this issue.

gothub commented 4 years ago

@jeanetteclark could you provide use cases and details of how you would use this info? This would help in determining the best implementation.

amoeba commented 3 years ago

We ran into an issue that kinda relates to this issue on PISCO. A resource map was created where one of the PIDs had a typo in it, effectively referencing and PID that doesn't exist and never will. So indexing failed. The quick way to fix this right now is to use arcticdatautils::update_resource_map but it'd be really nice if dataone could do this. The workflow might look like this:

pkg <- getDataPackage(client, "package_id")
replaceMember(pkg, "badMember", "goodMember")
uploadDataPackage(client)

This doesn't look like it would work right now, mostly because getDataPackage relies on the Solr index and packages broken like this won't be indexed.

jeanetteclark commented 3 years ago

Yeah that is another good use case. I often need to look at the relationships that are inside the resource map directly (not what is in the index), and this is one of the reasons why

mbjones commented 3 years ago

Agreed, reliance on the index is problematic. This seems like an easy fix in that calling getDataPackage with the package identifier should be able to grab it directly and parse it locally.

I think in general it would be good for us to download and parse the ORE file even when calling it with a metadata identifier. But is there a mechanism for determining the package identifier given only a metadata id if the ORE parsing has failed (i.e., can getDataPackage(client, "metadata_id") also find the package identifier and then download it for parsing if the ORE was not indexed)?

gothub commented 3 years ago

@mbjones currently you can specify a metadataid or resource map id to download a package. If you specify the metadata id, getDataPackage() uses the Solr index to determine the resmap id. Not sure how to determine that without the index.

Switching to parsing the resmap locally should be straightforward to implement.

amoeba commented 3 years ago

Since it's impossible to know which resource map the user is talking about given only a metadata record, arcticdatautils actually produces a warning when its get_package function is called with a metadata id instead of the id of a resource map. It then goes on to guess what the user meant using the same logic MetacatUI uses to do the same task.

mbjones commented 3 years ago

the same logic MetacatUI uses

And what is that logic? Does it depend on the index?

amoeba commented 3 years ago

Yep. IIRC the steps are like this:

mbjones commented 3 years ago

OK, well, that sounds like the same logic that is currently in the method. So, if we refactor to 1) retrieve the object and see if it is a package ORE, and if so, parse it and use it to load all of the objects, and 2) if not, then fall back to using the index to lookup the ORE pid, we should have parity in the methods. I think the key is the we should prioritize populating the package from the ORE file directly over the index. Would that solve this issue?