Editing dataset before resource map indexes

NCEAS / metacatui

MetacatUI: A client-side web interface for DataONE data repositories

https://nceas.github.io/metacatui

Apache License 2.0

42 stars 27 forks source link

Editing dataset before resource map indexes #610

Open maier-m opened 6 years ago

maier-m commented 6 years ago

It seems possible to edit a dataset before the resource map indexes. None of the data show up during editing, but is possible to push a new submission which will then not have any data in the resource map. See https://test.arcticdata.io/#view/urn:uuid:4d5fb781-401b-4614-a815-988db6d6ec1d and the subsequent updated dataset that was updated before the resource map indexed.

csjx commented 6 years ago

Thanks @maier-m , this is an interesting issue. The bottom line is that we need to process index tasks faster to avoid this situation, given that our calls are stateless. I can't think of another work-around, other than changing the index queue to be keyed off of date_received (not pid) and processing the index tasks in the order they are received. But even that seems super fragile, since the indexer would need to be aware of the obsolescence chain. We'll need to discuss this further.

amoeba commented 6 years ago

Another way around this is to switch the entry point into the Editor from a metadata PID to a resource map PID. We always know whether a given PID is a resource map and whether it's indexed. When the entrypoint is a metadata PID, it's not clear what the user wants to edit.

laurenwalker commented 6 years ago

MetacatUI is very metadata-centric rather than resource map-centric, so it may get tricky switching the editor pid to resource map pids.

I wish there was a different way to get the resource map id for a given object other than the Solr index. Even if we improve the indexing performance, it only takes a second or two delay for this bug to happen.

Currently, the editor will create a new resource map if one does not exist (or if the editor cannot find the resource map because it's not indexed). We could disable this feature for any resourcemap-less EML document created in the last X hours, since we can assume that the resource map just hasn't been indexed yet.

To make this more stable, we could develop a new API call to retrieve the number of objects in the index queue, so if there are no objects in the queue, we can enable editing.

amoeba commented 6 years ago

MetacatUI is very metadata-centric rather than resource map-centric, so it may get tricky switching the editor pid to resource map pids.

For sure.

Currently, the editor will create a new resource map if one does not exist

I feel like if we can get the indexer an order of magnitude faster this reliance won't be so troublesome.

mbjones commented 6 years ago

I agree. We mostly suffer from index backlogs, so Peter's work on parallelization should help us eventually with indexing.

However, this does raise the issue of whether there is a way to find the resource map without the index, as it generally exists on the MN in the Metacat object store and has a system metadata entry, but just isn't in SOLR yet. Let's ponder if there is another path to getting the identifier.

amoeba commented 6 years ago

However, this does raise the issue of whether there is a way to find the resource map without the index

Isn't this impossible, given only a metadata PID? A given metadata object can be a member of multiple packages and can also be a member of multiple packages that aren't in the same version chain (separate packages).

Our tools make a guess using a pretty simple heuristic but it's only correct when the index is up to date and the metadata object's package membership is simple (only exists in one version chain).

laurenwalker commented 5 years ago

Related to #845