VIVO-1752: Instantiate Solr core via API

chenejac commented 4 years ago

Benjamin Gross (Migrated from VIVO-1752) said:

Solr cores can be created and configured via RESTful API calls. Documentation is here: [https://lucene.apache.org/solr/guide/7_3/coreadmin-api.html#coreadmin-api]. Document the collection of calls that would replicate the vivocore directory currently provided at [https://github.com/vivo-community/vivo-solr/tree/vivo-solr-1.11.0/vivocore].

Advantages of this:

API calls are platform independent
API calls can be triggered via a script separate from VIVO or from within VIVO itself during a firsttime method
Less confusion compared to current process (copy vivocore schema directory into correct directory, ensure permissions are set correctly, delete directory afterwards because its no longer used once loaded)

chenejac commented 4 years ago

Benjamin Gross said:

It seems that Solr will require a solrconfig.xml file regardless of how the core is created. So there may not be a huge advantage to programmatically creating the schema via API if we still need to copy the config. As an interative step, I wrote a python script that, provided a valid Solr URL, will determine the correct solr.home value and copy the vivo-solr conf directory into the right place. Run from the vivo-solr directory.

import requests
import os
from shutil import copytree

core = "vivocore"
solr_url = "http://localhost:8983/solr"

* Ensure core doesn't already exist
r = requests.get(solr_url + "/admin/cores?action=STATUS&core=" + core)
if r.status_code != 200:
    raise ValueError("Unable to connect to Solr. Is the solr_url value correct?")

if r.json().get("status", "").get(core):
    raise ValueError("A core named \"{}\" already exists!".format(core))

* Determine Solr.home and Solr.data.dir
r = requests.get(solr_url + "/admin/info/system")
solr_home = r.json().get('solr_home')

* Copy files and instaniate core
"""
admin/cores?
action=CREATE&
name=core-name&
instanceDir=path/to/dir&
config=solrconfig.xml&dataDir=data
"""

* Copy the configuration directory into a new directory in Solr.home
instanceDir = solr_home + "/" + core
copytree("vivocore/conf", instanceDir + "/conf")

* Create the new core, using the VIVO configuration we just copied
r = requests.get(solr_url + "/admin/cores?action=CREATE&name=" + core +
                 "&instanceDir=" + instanceDir +
                 "&config=solrconfig.xml&dataDir=data")

r = r.json()
if "error" in r:
    raise
elif "core" in r:
    print('Successfully created core.')

chenejac commented 4 years ago

Benjamin Gross said:

Looks like Solr is delivered with some basic configurations located at [solr.home]/server/solr/configsets/_default/conf. So one solution would be to point the new core to one of these default configurations, then tweak any necessary changes using the ConfigAPI which will overlay the .xml file defaults.

chenejac commented 4 years ago

Andrew Woods said:

Potentially helpful description of migrating Solr 4 configuration to Solr 7: https://library.brown.edu/DigitalTechnologies/upgrading-from-solr-4-to-solr-7/

Thanks, [~accountid:70121:f6467998-8a46-4ff6-87ab-b06d85463d0a]!

chenejac commented 4 years ago

Benjamin Gross said:

At first glance, it seems like not everything in solrconfig.xml can be set via API. Looking at a diff of the default and the config delivered with VIVO there are some changes that can't be done via API, however there is a second example titled "sample_techproducts_configs" that includes almost almost everything in the vivo-solr solrconfig.xml.

What will need to be configured still are the search defaults for a select query, and the etag generation bit.


      <requestHandler name="/select" class="solr.SearchHandler">
    <!--requestHandler name="search" class="solr.SearchHandler" default="true"-->
        <!-- default values for query parameters can be specified, these
             will be overridden by parameters in the request
          -->
          <!--  Copying defaults from the old Vitro's solrconfig -->
          <lst name="defaults">
            <!-- Adding q.op here -->
            <str name="q.op">AND</str>
           <str name="defType">edismax</str>
           <!-- nameText added for NIHVIVO-3701 -->
           <str name="qf">ALLTEXT ALLTEXTUNSTEMMED nameText^2.0 nameUnstemmed^2.0 nameStemmed^2.0 nameLowercase</str>
           <str name="echoParams">explicit</str>
           <str name="qs">2</str>
           <int name="rows">10</int>
           <str name="q.alt">*:*</str>
           <str name="fl">*,score</str>
           <str name="hl">true</str>
           <str name="hl.fl">ALLTEXT</str>
           <str name="hl.fragsize">160</str>
           <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
           <str name="hl.simple.post"><![CDATA[</strong>]]></str>
          <!--  Default value of mm is 100% which should result in AND behavior, still setting it here
          https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser -->
          <str name="mm">100%</str>
</requestHandler>


  <!-- Update Request Handler.
      http://wiki.apache.org/solr/UpdateXmlMessages
      The canonical Request Handler for Modifying the Index through
      commands specified using XML, JSON, CSV, or JAVABIN
      Note: Since solr1.1 requestHandlers requires a valid content
      type header if posted in the body. For example, curl now
      requires: -H 'Content-type:text/xml; charset=utf-8'
      To override the request content type and force a specific
      Content-type, use the request parameter:
        ?update.contentType=text/csv
      This handler will pick a response format to match the input
      if the 'wt' parameter is not explicit
   -->

 <requestHandler name="/update" class="solr.UpdateRequestHandler">
   <!-- See below for information on defining
        updateRequestProcessorChains that can be used by name
        on each Update Request
     -->
     <lst name="defaults">
        <str name="update.chain">etag</str>
      </lst>
   <!--
      <lst name="defaults">
        <str name="update.chain">dedupe</str>
      </lst>
      -->
 </requestHandler>


<!-- ETag generation
     Creates the "etag" field on the fly based on a hash of all other
     fields.
  -->
   <updateRequestProcessorChain name="etag">
     <processor class="solr.processor.SignatureUpdateProcessorFactory">
       <bool name="enabled">true</bool>
       <str name="signatureField">etag</str>
       <bool name="overwriteDupes">false</bool>
       <str name="signatureClass">solr.processor.Lookup3Signature</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
   </updateRequestProcessorChain>

chenejac commented 4 years ago

Benjamin Gross said:

According to the documentation Solr's ConfigAPI does not support updateRequestProcessorChain which we use for creating etags. We can create the etag processor, not sure if the other parts of the processor chain (solr.LogUpdateProcessorFactory and solr.RunUpdateProcessorFactory) will happen automatically if we set up 'etag' to be a default processor for any updates...

chenejac commented 4 years ago

Benjamin Gross said:

Another problem... The Solr Schema API does not allow us to set the uniqueKey, and for some reason VIVO uses a custom 'DocID' field instead of the default 'id'. [https://issues.apache.org/jira/browse/SOLR-7242]

Vitro will have to be modified to use the default Solr id. Doesn't seem to rough: https://github.com/vivo-project/Vitro/search?q=docid

chenejac commented 4 years ago

Benjamin Gross said:

As discussed during the dev meeting today ([https://wiki.lyrasis.org/display/VIVO/2020-10-06+-+VIVO+Development+IG)] we will likely take a first step by VIVO or Vitro copying the solrconfig.xml and schema.xml files into the right spot and creating the core via API using those files.

chenejac commented 4 years ago

Andrew Woods said:

Notes, Using LukeRequestHandler (https://cwiki.apache.org/confluence/display/SOLR/LukeRequestHandler):

Update server/solr/vivocore/conf/solrconfig.xml with:
Get schema info with: curl http://localhost:8983/solr/vivocore/admin/luke?numTerms=0

chenejac commented 4 years ago

Benjamin Gross said:

Question that came up during the call today... if a user has an existing hardened Solr installation, will VIVO be able to determine the location of Solr.home? How about if they install via using this script? https://lucene.apache.org/solr/guide/8_0/taking-solr-to-production.html#taking-solr-to-production

chenejac commented 4 years ago

Andrew Woods said:

Closing this ticket due to:

Solr API for standalone Solr does not provide for all of the configuration required by VIVO
Having VIVO copy configuration into Solr's directories while also ensuring that the Solr user/application can write to those directories requires as much (if not more) sysadmin work than to simply copy the VIVO-specific Solr configuration files by hand.

The solution here is to stay with the pattern of suggesting sysadmins configure VIVO's Solr by copying the configuration found in https://github.com/vivo-project/vivo-solr per the associated instructions.

We may revisit this ticket in the context of supporting Solr-Cloud, which supports a more complete API.

chenejac / VIVOTestMigration

VIVO-1752: Instantiate Solr core via API #1638