NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
27 stars 13 forks source link

Add `reindex` method to MNAdmin API #1716

Closed taojing2002 closed 8 months ago

taojing2002 commented 12 months ago

Now, we disabled the reindex and reindexall methods on the olde Metacat API. They are very helpful. I am planning to add them into the Metacat Admin API. They are still only called by the Metacat administrators.

mbjones commented 11 months ago

This is good, Jing. I suggest it should be a single reindex method that is configurable to index a list of PIDs or all PIDs, and that we should put this in a new MNAdmin API service that we implement just for Metacat for now. Here's a proposed set of docs for a MNAdmin.reindex() method. Feedback and changes welcomed.

[Admin API]()

Administer repository deployments.

The admin API methods are used to administer a repository deployment, and the API MUST be restricted to authenticated subjects with administrative permissions. While the API overall may have utility across the DataONE network, the implementation of these methods will vary across different software implementations of the DataONE API, and some implementations may not need all (or any) of these facilities. The Admin service and all of its API methods are therefore optional and can be omitted from implementation, or may return NotImplemented exceptions as appropriate.

reindex(session, pid, all=FALSE)

GET /reindex[?pid={pid}, all=FALSE]]

For repositories that uses a separate index for metadata search and discovery, this method triggers a reindexing task to ensure the index is up-to-date, either for one or more pid values passed in as parameters, or for all PIDs on the node if all=TRUE. If the caller does not provide a list of PIDs to be processed and does not set all=TRUE, then the method should do nothing, to prevent accidental requests to reindex the entire corpus. For many implementations, this method would only queue an asynchronous indexing task, and not wait for the task to complete.

Parameters

taojing2002 commented 11 months ago

Looks good! I like the idea to use an explicit parameter to indicate if Metacat indexes all pids. This can avoid accidental time-consuming processes.

I have two things needed to be clarified:

  1. The command for multiple pids. Does it looks like:
    GET /reindex[?pid={pid}, pid={pid1}, pid={pid2}]
  2. What should we consider this command?
    GET /reindex[?pid={pid}, all=TRUE]
mbjones commented 11 months ago

Yeah, I thought about your point 1 a bit. Seems like there are two options:

For the other question, I think all=TRUE should always indicate that you reindex all pids, and the pid parameter is irrelvant as a subset of the whole.

taojing2002 commented 11 months ago

I think the first form is better since some clients, e.g. ess-dive folks, have scripts based on this form. So it can minimize their changes if we adopt the first form.

taojing2002 commented 11 months ago

I added a NotAuthorized exception for this mn method.

artntek commented 10 months ago

see PR #1738

taojing2002 commented 9 months ago

Based on our discussion about this ticket, we revised the rest api call:

PUT /query/solr/index?[all=true] | [&pid={pid}]

PUT /query/solr/index?all=true //Reindex everything
PUT /query/solr/index?all=true&pid={pid1}  //pid1 will be ignored. Reindex everything
PUT /query/solr/index?pid={pid1}&pid={pid2} //Reindex pid1 and pid2
PUT /query/solr/index?all=false&pid={pid1}&pid={pid2} //Reindex pid1 and pid2

Methods: reindex(session, pids[]) Parameters pids a repeatable list of identifiers of objects to be reindexed Returns Boolean TRUE if the reindex request is scheduled Return type: Types.Boolean Raises:

reindexAll(session) Returns Boolean TRUE if the reindexall request is scheduled Return type: Types.Boolean Raises:

artntek commented 9 months ago

Important to note we are treating index as a noun/resource here - it is not a verb

mbjones commented 9 months ago

Even as a noun, it then treats the index as a subcollection of the SOLR query engine, which already tries to pass its contents off as a SOLR query string. Is there a potential conflict there?

Also, do we want reindex to be scoped only to SOLR? What if we switch from SOLR to elasticsearch in the future? Reinfecting would still be needed. Can we separate the API from the implementation? I'd prefer that.

artntek commented 9 months ago

Good points. Would this simpler form work, in that case?

PUT /index?[all=true] | [&pid={pid}]
mbjones commented 8 months ago

I think I like that better. It treats the index as a resource (which it is), and makes the service implementation independent. Let's discuss during tomorrow's backend call to finalize.

taojing2002 commented 8 months ago

Final version:

Rest Calls:

PUT /index/{pid1}
PUT /index[/]?[all=true] | [&pid={pid}]

PUT /index?all=true //Reindex everything
PUT /index?all=true&pid={pid1}  //pid1 will be ignored. Reindex everything
PUT /index/?pid={pid1}&pid={pid2} //Reindex pid1 and pid2
PUT /index/?all=false&pid={pid1}&pid={pid2} //Reindex pid1 and pid2

Two Methods: reindex(session, pids[]) Parameters: pids a repeatable list of identifiers of objects to be reindexed Returns Boolean TRUE if the reindex request is scheduled Return type: Types.Boolean Raises:

reindexAll(session) Returns Boolean TRUE if the reindexall request is scheduled Return type: Types.Boolean Raises:

artntek commented 8 months ago

Did we decide not to support the following example? Or should we allow it as valid?

PUT /index/{pid1}
taojing2002 commented 8 months ago

We didn't mention to support this example.

Did we decide not to support the following example? Or should we allow it as valid?

PUT /index/{pid1}
mbjones commented 8 months ago

I requested that we support the standard REST syntax. We discussed that on the call today @taojing2002

PUT /index/{pid1}
taojing2002 commented 8 months ago

I just remembered we discussed the issue on /identifier/{pid}. Okay, I just added it.