Closed taojing2002 closed 8 months ago
This is good, Jing. I suggest it should be a single reindex
method that is configurable to index a list of PIDs or all PIDs, and that we should put this in a new MNAdmin
API service that we implement just for Metacat for now. Here's a proposed set of docs for a MNAdmin.reindex()
method. Feedback and changes welcomed.
Administer repository deployments.
The admin API methods are used to administer a repository deployment, and the API MUST be restricted to authenticated subjects with administrative permissions. While the API overall may have utility across the DataONE network, the implementation of these methods will vary across different software implementations of the DataONE API, and some implementations may not need all (or any) of these facilities. The Admin service and all of its API methods are therefore optional and can be omitted from implementation, or may return NotImplemented
exceptions as appropriate.
reindex(session, pid, all=FALSE)
GET /reindex[?pid={pid}, all=FALSE]]
For repositories that uses a separate index for metadata search and discovery, this method triggers a reindexing task to ensure the index is up-to-date, either for one or more pid
values passed in as parameters, or for all PIDs on the node if all=TRUE
. If the caller does not provide a list of PIDs to be processed and does not set all=TRUE
, then the method should do nothing, to prevent accidental requests to reindex the entire corpus. For many implementations, this method would only queue an asynchronous indexing task, and not wait for the task to complete.
TRUE
will cause all objects in the system to be reindexed ( a potentially very expensive operation)Looks good! I like the idea to use an explicit parameter to indicate if Metacat indexes all pids. This can avoid accidental time-consuming processes.
I have two things needed to be clarified:
GET /reindex[?pid={pid}, pid={pid1}, pid={pid2}]
GET /reindex[?pid={pid}, all=TRUE]
Yeah, I thought about your point 1 a bit. Seems like there are two options:
For the other question, I think all=TRUE
should always indicate that you reindex all pids, and the pid
parameter is irrelvant as a subset of the whole.
I think the first form is better since some clients, e.g. ess-dive folks, have scripts based on this form. So it can minimize their changes if we adopt the first form.
I added a NotAuthorized exception for this mn method.
see PR #1738
Based on our discussion about this ticket, we revised the rest api call:
PUT /query/solr/index?[all=true] | [&pid={pid}]
PUT /query/solr/index?all=true //Reindex everything
PUT /query/solr/index?all=true&pid={pid1} //pid1 will be ignored. Reindex everything
PUT /query/solr/index?pid={pid1}&pid={pid2} //Reindex pid1 and pid2
PUT /query/solr/index?all=false&pid={pid1}&pid={pid2} //Reindex pid1 and pid2
Methods: reindex(session, pids[]) Parameters pids a repeatable list of identifiers of objects to be reindexed Returns Boolean TRUE if the reindex request is scheduled Return type: Types.Boolean Raises:
reindexAll(session) Returns Boolean TRUE if the reindexall request is scheduled Return type: Types.Boolean Raises:
Important to note we are treating index
as a noun/resource here - it is not a verb
Even as a noun, it then treats the index as a subcollection of the SOLR query engine, which already tries to pass its contents off as a SOLR query string. Is there a potential conflict there?
Also, do we want reindex to be scoped only to SOLR? What if we switch from SOLR to elasticsearch in the future? Reinfecting would still be needed. Can we separate the API from the implementation? I'd prefer that.
Good points. Would this simpler form work, in that case?
PUT /index?[all=true] | [&pid={pid}]
I think I like that better. It treats the index as a resource (which it is), and makes the service implementation independent. Let's discuss during tomorrow's backend call to finalize.
Rest Calls:
PUT /index/{pid1}
PUT /index[/]?[all=true] | [&pid={pid}]
PUT /index?all=true //Reindex everything
PUT /index?all=true&pid={pid1} //pid1 will be ignored. Reindex everything
PUT /index/?pid={pid1}&pid={pid2} //Reindex pid1 and pid2
PUT /index/?all=false&pid={pid1}&pid={pid2} //Reindex pid1 and pid2
Two Methods: reindex(session, pids[]) Parameters: pids a repeatable list of identifiers of objects to be reindexed Returns Boolean TRUE if the reindex request is scheduled Return type: Types.Boolean Raises:
reindexAll(session) Returns Boolean TRUE if the reindexall request is scheduled Return type: Types.Boolean Raises:
Did we decide not to support the following example? Or should we allow it as valid?
PUT /index/{pid1}
We didn't mention to support this example.
Did we decide not to support the following example? Or should we allow it as valid?
PUT /index/{pid1}
I requested that we support the standard REST syntax. We discussed that on the call today @taojing2002
PUT /index/{pid1}
I just remembered we discussed the issue on /identifier/{pid}
. Okay, I just added it.
Now, we disabled the
reindex
andreindexall
methods on the olde Metacat API. They are very helpful. I am planning to add them into the Metacat Admin API. They are still only called by the Metacat administrators.