NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
26 stars 12 forks source link

Add `updateIdMetadata` method to MNAdmin API #1766

Closed taojing2002 closed 7 months ago

taojing2002 commented 8 months ago

We are removing the Metacat replication servlet, which has the methods to update the Datacite metadata documents for DOIs. So we need a new method in MNAdmin API to allow operator to do the same tasks.

Here is the proposed method:

updateDatacite(session, option, pid[], formatId[])

GET /updateDatacite?option={all | pids | formatIds} [&pid={pid}] [&formatId={formatId}]

This method triggers an updating task to ensure the all or partial (specified by either a list of pids or format ids) DOI objects' metadata is up-to-date. If the caller specify option all, both pid and formatId parameters will be ignored. If the caller specify option pids, the formatId parameter will be ignored. For many implementations, this method would only queue an asynchronous updating task, and not wait for the task to complete.

Parameters

artntek commented 8 months ago

The proposed REST call (GET /updateDatacite?option=...) doesn't comply with RESTful design principles:

From these descriptions it sounds like PATCH might fit best, but my understanding is that it's not widely supported by browsers etc. In that case, maybe PUT?

In summary - I don't think GET is right, but I don't think there's a clear "right answer" from the other options - so am totally open to discussion.

Once that's decided, the REST call might look like this (Assuming we use PUT - just for an example)

PUT /datacite/batchUpdate?option={all | pids | formatIds} [&pid={pid}] [&formatId={formatId}]

Followup Question

Is there any reason we can't accept a mix of pids and formatIds in the same REST call? Could we allow the caller to mix them, instead of having to define ?option=?

mbjones commented 8 months ago

I agree with much of what Matthew said. Let's discuss this before you implement.

In addition, REST is all about managing resources, rather than modeling remote procedure calls (RPC). So each API should be against a specific resource or collection of resources. Typically that is the objects resource collection in DataONE, but in this case, the resource of interest is the identifiers collection. In addition, DataCite is not the only identifier minting authority that we might use for registering identifier metadata -- we also use EZID and OSTI, and those may be extended to non-DOI types like ARKs in the future. So, a method that works on an identifiers collection and does a PUT to a specific identifier, and then based on that identifier knows which service to contact would be best. So, how about something like:

For a single identifier, the traditional REST structure would look like:

PUT /identifiers/{pid}

For the collection-wide operation, it might be more like:

PUT /identifiers?option={all | pids | formatIds} [&pid={pid}] [&formatId={formatId}]

Note there is no method like "batchUpdate" in a REST style API -- instead, we are modifying either a single resource or a whole collection. Let's discuss.

taojing2002 commented 8 months ago

In our today's backend developers' meeting, we revised the rest API:

PUT /identifiers/doi/?[all=true] | [&pid={pid1}] | [&formatId={formatId1}]

//examples
PUT /identifiers/doi/?all=true  //update everything
PUT /identifiers/doi/?pid={pid1}&pid={pid2}  //update pid1 and pid2
PUT /identifiers/doi/?formatId={formatId1}&formatId={formatId2} //update dois with formatId1 or formatId2
// process all elements in a mixed list:
PUT /identifiers/doi/?formatId={formatId1}&pid={pid2} 
// formatIds and pids ignored, since `all=true` updates everything anyway
PUT /identifiers/doi/?all=true&formatId={formatId}&pid={pid2}
PUT /identifiers/doi/?all=false&pid={pid2} //update pid2

Methods: updateIdMetadata(Session session, Enum type, String[] pids, String[] formatIds) Parameters: type a type of the identifier, e.g. DOI pid a repeatable list of identifiers of objects to be updated formatId a repeatable list of identifiers of object format. Any objects of the type with those format ids will be updated. Returns Boolean TRUE if the updating request is scheduled Return type: Types.Boolean Raises:

updateAllIdMetadata(Session session, Enum type) Parameters: type a type of the identifier, e.g. DOI Returns Boolean TRUE if the updating request is scheduled Return type: Types.Boolean Raises:

artntek commented 8 months ago

LGTM!

mbjones commented 8 months ago

Overall looks good, but I think you can omit the DOI part of the rest url. The service should be able to tell what kind of identifier it is.

taojing2002 commented 8 months ago

@mbjones @artntek If we don't have DOI part of the rest url, how we interpret those two requests:

PUT /identifiers/?all=true
PUT /identifiers/doi/?formatId=formatId1

Does the first one mean we update metadata for all identifiers of all supported types (e.g. doi, ark)? Does the second one mean we update metadata for all identifiers of all supported types which have the formatId1 format?

mbjones commented 7 months ago

Earlier, I proposed:

For a single identifier, the service will update that one identifier's metadata based on its type (e.g., DOIs go one place and arks go another):

PUT /identifiers/{pid}

For the collection-wide operation, we can update multiple at once, and for each PID, the right location to update is determined by the identifier type:

PUT /identifiers?option={all | pids | formatIds} [&pid={pid}] [&formatId={formatId}]

So, under that scenario, your two calls are below in the proposed REST syntax:

PUT /identifiers/?option=all
PUT /identifiers/?option=formatId&formatId=formatId1

Let's discuss on slack or in our meeting and summarize here when we reach a conclusion.

taojing2002 commented 7 months ago

Final Version:

Rest Calls:

PUT /identifiers/{pid1}  
PUT /identifiers[/]?[all=true] | [&pid={pid1}] | [&formatId={formatId1}]

PUT /identifiers/{pid1}     //update pid1
PUT /identifiers/?all=true  //update everything
PUT /identifiers/?pid={pid1}&pid={pid2}  //update pid1 and pid2
PUT /identifiers/?formatId={formatId1}&formatId={formatId2} //update dois with formatId1 or formatId2

PUT /identifiers?formatId={formatId1}&pid={pid2} //process all elements in a mixed list
PUT /identifiers?all=true&formatId={formatId}&pid={pid2} //formatIds and pids ignored, since `all=true` updates everything anyway
PUT /identifiers?all=false&pid={pid2} //update pid2

Two Methods: updateIdMetadata(Session session, String[] pids, String[] formatIds) Parameters: pid a repeatable list of identifiers of objects to be updated formatId a repeatable list of identifiers of object format. Any objects of the type with those format ids will be updated. Returns Boolean TRUE if the updating request is scheduled Return type: Types.Boolean Raises:

updateAllIdMetadata(Session session) Parameters: Returns Boolean TRUE if the updating request is scheduled Return type: Types.Boolean Raises:

artntek commented 7 months ago

need to add this to admin guide documentation too, since it's not part of D1 API docs