collective / collective.solr

Solr search engine integration for Plone
https://pypi.org/project/collective.solr/
22 stars 46 forks source link

[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Indexing file gives 404 #238

Closed NicolasGoeddel closed 2 years ago

NicolasGoeddel commented 4 years ago

Hi, when I add a PDF file to Plone it can not be indexed. I get this error in the log:

2019-09-13 13:09:00,244 WARNING [collective.solr.indexer:164][waitress] Error HTTP code=404, reason=Not Found @ /plone/testpdf.pdf

The PDF can be viewed without problems when browsing to its URL. However the stream.file parameter within postdata does contain the local path to the file inside the blobcache. I don't know if this is an issue at all but I was thinking that this can not work if Solr has no access to the same filesystem as the blobcache has. Shouldn't stream.file be a URL to Plone like http://localhost:8080/plone/testpdf.pdfor something like this?

Edit 1: I am using the configuration from here: https://github.com/kitconcept/kitconcept.recipe.solr/tree/master/config @tisto mentioned it here: https://github.com/collective/collective.solr/issues/237#issuecomment-530912876

After a bit of researching I am thinking that the ExtractingRequestHandler is missing. Is there also a working example for Plone available? Or can I just copy paste the configuration from this example: http://makble.com/how-to-extract-text-from-pdf-and-post-into-solr

Edit 2: I added this to my solrconfig.xml:

  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>

Now I am getting this error:

2019-09-13 14:57:24,181 WARNING [collective.solr.indexer:164][waitress] Error HTTP code=400, reason=Bad Request @ /bfd-db/testpdf.pdf

However Solr itself does not show any Warning or Error in its log. I am not sure what's the problem now. I just want to be able to index PDF files. What did I miss?

Edit 3: I found out that I have to enable streaming by adding three lines to solrconfig.xml:

    <requestDispatcher>
        <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="-1" formdataUploadLimitInKB="-1" addHttpRequestToContext="false" />
    </requestDispatcher>

But now there is a new issue. Because Solr runs as another user than Plone it has no access to the blob file. Snippet from the Solr log:

2019-09-13 14:38:53.503 ERROR (qtp1126185196-24) [   x:bfd-db] o.a.s.s.HttpSolrCall null:java.io.FileNotFoundException: /home/ploneuser/plone/zinstance/var/cacheblob/469/6.03d2688374835688.blob (Permission denied)

Is there another way to send a blob file to Solr without giving access to the blobcache?

Edit 4: I was able to give Solr access to the blobcache by chmod +rxing the var directory.

But with every solution there is a new error:

2019-09-13 17:40:27,907 WARNING [collective.solr.indexer:168][waitress] Parsing error Start tag expected, '<' not found, line 1, column 1 (<string>, line 1) @ /plone/testpdf.pdf.

I think the reason here is that Solr does not answer with a XML response but indexer.BinaryAdder is expecting that. On the other side Solr outputs this exception:

2019-09-13 15:40:27.917 INFO  (qtp1126185196-20) [   x:bfd-db] o.a.s.s.HttpSolrCall Unable to write response, client closed connection or we are shutting down => org.eclipse.jetty.io.EofException: Closed

Did I configure something wrong or is this an open issue?

Thank you.

NicolasGoeddel commented 4 years ago

The solution seems to be adding

postdata["wt"] = "xml"

to collective.solr.indexer.BinaryAdder.__call__(). Then Solr answers with a proper XML response and c.solr can parse it. After that the PDF got indexed and I can search for everything in it.

sauzher commented 4 years ago

Hi, take a look here: https://github.com/collective/collective.solr/issues/28

Il giorno ven 13 set 2019 alle ore 18:11 NicolasGoeddel < notifications@github.com> ha scritto:

The solution seems to be adding

postdata["wt"] = "xml"

to collective.solr.indexer.BinaryAdder.call(). Then Solr answers with a proper XML response and c.solr can parse it. After that the PDF got indexed and I can search for everything in it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/collective/collective.solr/issues/238?email_source=notifications&email_token=AAGQON6ACACWBZIKBAN5RJ3QJO3TBA5CNFSM4IWO3QIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VPT3A#issuecomment-531298796, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGQON2LFKDIF3BGNGPRJYLQJO3TBANCNFSM4IWO3QIA .

-- bye alessandro ceglie (aka sauzher)


l'iterazione è umana... la ricorsione, Divina!


reply to: sauzher AT gmail DOT com

NicolasGoeddel commented 4 years ago

That does not help me. The documentation link mentioned there is offline.

This is a nice idea but does not work when the object to be indexed is not publicly available.

postdata['stream.url'] = self.context.absolute_url()

The best idea should be to upload the binary blob to Solr instead of sending references like paths or URLs to the binary data.

Maybe I find a way to change the extraction method to work all the time.

sauzher commented 4 years ago

I apologize. The point was: better to use distributed file system like NFS and so mounting the blobstorage with the same path among zeoserver, zeoclient and solr-instance.

NicolasGoeddel commented 4 years ago

I don't like that idea. It is very easy to extract content from a binary file with Solr using curl by uploading the file:

curl 'http://localhost:8983/solr/mycore/update/extract?extractFormat=text&extractOnly=true&wt=xml' -F 'myfile=@testpdf.pdf'

I think that should be the better approach here. It would be completely independent from the filesystem, domains and ports Plone is running on.

It should be easy to do within the collective.solr.indexer.BinaryAdder class. At the moment I am struggling with the http_client used in collective.solr.solr because it seems not to be prepared for POST requests with form data and binary uploads. Maybe I find a way around that tomorrow.

NicolasGoeddel commented 4 years ago

I was able to change the extracting process using an additional tool: https://pypi.org/project/requests-toolbelt/ These are the changes: In collective.solr/solr.py, add the import

from requests_toolbelt import MultipartEncoder

and change the first line of SolrConnection.doGetOrPost() to this:

        if not isinstance(body, (six.binary_type, MultipartEncoder)):

Then in collective.solr/indexer.py also add the same import:

from requests_toolbelt import MultipartEncoder

and finally change the BinaryAdder class to this:

class BinaryAdder(DefaultAdder):
    """ Add binary content to index via tika
    """

    def getblob(self):
        field = self.context.getPrimaryField()
        return field.get(self.context).blob

    def getpath(self):
        blob = self.getblob()
        if blob is None:
            return None
        try:
            path = blob.committed()
        except BlobError:
            path = blob._p_blob_committed or blob._p_blob_uncommitted 
        logger.debug("Indexing BLOB from path %s", path)
        return path

    def __call__(self, conn, **data):
        postdata = {}
        path = self.getpath()
        if path is None:
            super(BinaryAdder, self).__call__(conn, **data)

        openedBlob = self.getblob().open()

        postdata["extractFormat"] = "text"
        postdata["extractOnly"] = "true"
        postdata["wt"] = "xml"
        postdata['myfile'] = (data['id'], openedBlob, data.get("content_type", "application/octet-stream"))

        encodedPost = MultipartEncoder(fields = postdata)

        headers = conn.formheaders
        headers['Content-Type'] = encodedPost.content_type

        url = "%s/update/extract" % conn.solrBase

        try:
            response = conn.doPost(
                url, encodedPost.to_string(), headers
            )
            root = etree.parse(response)
            data["SearchableText"] = root.find(".//str").text.strip()
        except SolrConnectionException as e:
            logger.warn("Error %s @ %s", e, data["path_string"])
            data["SearchableText"] = ""
        except etree.XMLSyntaxError as e:
            logger.warn("Parsing error %s @ %s.", e, data["path_string"])
            data["SearchableText"] = ""
        finally :
            openedBlob.close()

        super(BinaryAdder, self).__call__(conn, **data)

After this change I don't have to worry about the visibility of objects or port, hostname or file system of Plone anymore.

It would be nice to see this change in future versions of collective.solr. Or at least an option to use the upload feature of Solr.

Edit 1: It seems that conn.doPost() does not always work right with a MultipartEncoder object. Therefore I changed the second parameter to encodedPost.to_string() which returns bytes.

tisto commented 4 years ago

@NicolasGoeddel would you mind creating a pull request adding the xml param? I recall that I had to add this param when working with newer versions of Solr. Could be that I missed adding it to the binary added.

tisto commented 4 years ago

@NicolasGoeddel I investigated this a bit and it seems that for whatever reason BinaryIndexer is used in the latest c.solr master. This shouldn't happen since indexing binary files in tika was always optional and never the default AFAIK. I could fix the problem by just disabling the BinaryAdded zcml registrations.

Do you have a full tika config working now? Would you mind sharing the full config? I could imagine to move to using tika for indexing. Though, so far I never felt the need to do so.

tisto commented 4 years ago

@NicolasGoeddel I created a fix for the BinaryAdder to make sure c.solr sends XML:

https://github.com/collective/collective.solr/pull/251

I also have a Solr config that works with Tika now. Will try to find time to push it to kitconcept.recipe.solr.

1letter commented 3 years ago

I have play a little bit with a configuration of solr 8.6.2 and collective.solr. Include indexing of binary files. I got many Unicode Warnings but the indexing runs. Old Excel Files "xls Files" producing Log-Errors, but i can live with this.

<!-- schema.xml -->
<?xml version="1.0" encoding="UTF-8" ?>

<schema name="plone"
  version="1.6">

  <uniqueKey>UID</uniqueKey>

  <types>
    <!-- Default Field Types -->
    <fieldType name="long"
      class="solr.LongPointField"
      positionIncrementGap="0"
      docValues="true"/>
    <fieldType name="boolean"
      class="solr.BoolField"
      sortMissingLast="true"
      multiValued="true"/>
    <fieldType name="date"
      class="solr.DatePointField"
      positionIncrementGap="0"
      docValues="true"/>
    <fieldType name="tfloat"
      class="solr.FloatPointField"
      positionIncrementGap="0"
      docValues="true"/>
    <fieldType name="tfloats"
      class="solr.FloatPointField"
      positionIncrementGap="0"
      multiValued="true"/>
    <fieldType name="tint"
      class="solr.IntPointField"
      positionIncrementGap="0"
      docValues="true"/>
    <fieldType name="tints"
      class="solr.IntPointField"
      positionIncrementGap="0"
      multiValued="true" />
    <fieldType name="tlong"
      class="solr.LongPointField"
      positionIncrementGap="0"
      docValues="true" />
    <fieldType name="tlongs"
      class="solr.LongPointField"
      positionIncrementGap="0"
      multiValued="true" />
    <fieldType name="point"
      class="solr.PointType"
      subFieldSuffix="_d"
      dimension="2"/>
    <fieldType name="random"
      class="solr.RandomSortField"
      indexed="true"/>
    <fieldType name="string"
      class="solr.StrField"
      sortMissingLast="true"/>
    <fieldType name="strings"
      class="solr.StrField"
      sortMissingLast="true"
      multiValued="true"/>
    <fieldType name="tdate"
      class="solr.DatePointField"
      positionIncrementGap="0"
      docValues="true" />
    <fieldType name="tdates"
      class="solr.DatePointField"
      positionIncrementGap="0"
      multiValued="true" />
    <fieldType name="tdouble"
      class="solr.DoublePointField"
      positionIncrementGap="0"
      docValues="true"/>
    <fieldType name="tdoubles"
      class="solr.DoublePointField"
      positionIncrementGap="0"
      multiValued="true" />

    <!-- A general text field that has reasonable, generic
         cross-language defaults: it tokenizes with StandardTokenizer,
           removes stop words from case-insensitive "stopwords.txt"
           (empty by default), and down cases.  At query time only, it
           also applies synonyms.
      -->
    <fieldType name="text_general"
      class="solr.TextField"
      positionIncrementGap="100"
      multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
          ignoreCase="true"
          words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
          ignoreCase="true"
          words="stopwords.txt" />
        <filter class="solr.SynonymGraphFilterFactory"
          synonyms="synonyms.txt"
          ignoreCase="true"
          expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="text"
      class="solr.TextField"
      positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory"
          mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory"
          synonyms="synonyms.txt"
          ignoreCase="true"
          expand="false"/>
        <filter class="solr.StopFilterFactory"
          ignoreCase="true"
          words="stopwords.txt" />
        <filter class="solr.WordDelimiterGraphFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory"
          withOriginal="true"
          maxPosAsterisk="2"
          maxPosQuestion="1"
          minTrailing="2"
          maxFractionAsterisk="0"/>
        <filter class="solr.FlattenGraphFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory"
          mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory"
          synonyms="synonyms.txt"
          ignoreCase="true"
          expand="true"/>
        <filter class="solr.StopFilterFactory"
          ignoreCase="true"
          words="stopwords.txt" />
        <filter class="solr.WordDelimiterGraphFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="0"
          catenateNumbers="0"
          catenateAll="0"
          preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    <!-- lowercases the entire field value, keeping it as a single token.  -->
    <fieldType name="lowercase"
      class="solr.TextField"
      positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>

    <!-- ignore unknown Field binaries-->
    <fieldType name="ignored"
      indexed="false"
      stored="false"
      class="solr.StrField" />

  </types>

  <fields>
    <field name="id"
      type="string"
      indexed="true"
      stored="true"
      required="false" />
    <field name="_version_"
      type="long"
      indexed="true"
      stored="true"/>

    <!-- Plone Core Fields -->
    <!-- name:allowedRolesAndUsers   type:string stored:true multivalued:true -->
    <field name="allowedRolesAndUsers"
      type="string"
      indexed="true"
      stored="true"
      multiValued="true" />
    <!-- name:created                type:date stored:true -->
    <field name="created"
      type="date"
      indexed="true"
      stored="true" />
    <!-- name:Creator                type:string stored:true -->
    <field name="Creator"
      type="string"
      indexed="true"
      stored="true" />
    <!-- name:Date                   type:date stored:true -->
    <field name="Date"
      type="date"
      indexed="true"
      stored="true" />
    <!-- name:default                type:text indexed:true stored:false multivalued:true -->
    <field name="default"
      type="text"
      indexed="true"
      stored="false"
      multiValued="true" />
    <!-- name:Description            type:text copyfield:default stored:true -->
    <field name="Description"
      type="text"
      indexed="true"
      stored="true" />
    <!-- name:effective              type:date stored:true -->
    <field name="effective"
      type="date"
      indexed="true"
      stored="true" />
    <!-- name:exclude_from_nav       type:boolean indexed:false stored:true -->
    <field name="exclude_from_nav"
      type="boolean"
      indexed="false"
      stored="true" />
    <!-- name:expires                type:date stored:true -->
    <field name="expires"
      type="date"
      indexed="true"
      stored="true" />
    <!-- name:getIcon                type:string indexed:false stored:true -->
    <field name="getIcon"
      type="string"
      indexed="false"
      stored="true" />
    <!-- name:getId                  type:string indexed:false stored:true -->
    <field name="getId"
      type="string"
      indexed="false"
      stored="true" />
    <!-- name:getRemoteUrl           type:string indexed:false stored:true -->
    <field name="getRemoteUrl"
      type="string"
      indexed="false"
      stored="true" />
    <!-- name:is_folderish           type:boolean stored:true -->
    <field name="is_folderish"
      type="boolean"
      indexed="true"
      stored="true" />
    <!-- name:Language               type:string stored:true -->
    <field name="Language"
      type="string"
      indexed="true"
      stored="true" />
    <!-- name:modified               type:date stored:true -->
    <field name="modified"
      type="date"
      indexed="true"
      stored="true" />
    <!-- name:object_provides        type:string stored:true multivalued:true -->
    <field name="object_provides"
      type="string"
      indexed="true"
      stored="true"
      multiValued="true" />
    <!-- name:path_depth             type:integer indexed:true stored:true -->
    <field name="path_depth"
      type="tint"
      indexed="true"
      stored="true" />
    <!-- name:path_parents           type:string indexed:true stored:true multivalued:true -->
    <field name="path_parents"
      type="string"
      indexed="true"
      stored="true"
      multiValued="true" />
    <!-- name:path_string            type:string indexed:false stored:true -->
    <field name="path_string"
      type="string"
      indexed="false"
      stored="true" />
    <!-- name:portal_type            type:string stored:true -->
    <field name="portal_type"
      type="string"
      indexed="true"
      stored="true" />
    <!-- name:review_state           type:string stored:true -->
    <field name="review_state"
      type="string"
      indexed="true"
      stored="true" />
    <!-- name:SearchableText         type:text copyfield:default stored:true -->
    <field name="SearchableText"
      type="text"
      indexed="true"
      stored="true"
      multiValued="true" />
    <!-- name:searchwords            type:string stored:true multivalued:true -->
    <field name="searchwords"
      type="string"
      indexed="true"
      stored="true"
      multiValued="true" />
    <!-- name:showinsearch           type:boolean stored:true -->
    <field name="showinsearch"
      type="boolean"
      indexed="true"
      stored="true" />
    <field name="sortable_title"
      type="string"
      indexed="true"
      stored="true" />
    <!-- name:Subject                type:string copyfield:default stored:true multivalued:true -->
    <field name="Subject"
      type="string"
      indexed="true"
      stored="true"
      multiValued="true" />
    <!-- name:Title                  type:text copyfield:default stored:true -->
    <field name="Title"
      type="text"
      indexed="true"
      stored="true" />
    <!-- name:Type                   type:string stored:true -->
    <field name="Type"
      type="string"
      indexed="true"
      stored="true" />
    <!-- name:UID                    type:string stored:true required:true -->
    <field name="UID"
      type="string"
      indexed="true"
      stored="true"
      required="false" />

    <copyField source="Title"
      dest="default"/>
    <copyField source="Description"
      dest="default"/>
    <copyField source="Subject"
      dest="default"/>
    <copyField source="default"
      dest="SearchableText"/>

  </fields>

</schema>
<!--solrconfig.xml-->
<?xml version="1.0" encoding="UTF-8" ?>
<config>
  <luceneMatchVersion>8.6.2</luceneMatchVersion>

  <dataDir>${solr.data.dir:}</dataDir>

  <directoryFactory name="DirectoryFactory"
    class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>

  <!--codecFactory class="solr.SchemaCodecFactory"/-->
  <schemaFactory class="ClassicIndexSchemaFactory"/>

  <!-- The default high-performance update handler -->
  <updateHandler class="solr.DirectUpdateHandler2">

    <!-- Enables a transaction log, used for real-time get, durability, and
         and solr cloud replica recovery.  The log can grow as big as
         uncommitted changes to the index, so use of a hard autoCommit
         is recommended (see below).
         "dir" - the target directory for transaction logs, defaults to the
                solr data directory.
         "numVersionBuckets" - sets the number of buckets used to keep
                track of max version values when checking for re-ordered
                updates; increase this value to reduce the cost of
                synchronizing access to version buckets during high-volume
                indexing, this requires 8 bytes (long) * numVersionBuckets
                of heap space per Solr core.
    -->
    <updateLog>
      <str name="dir">${solr.ulog.dir:}</str>
      <int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
    </updateLog>

    <!-- AutoCommit

         Perform a hard commit automatically under certain conditions.
         Instead of enabling autoCommit, consider using "commitWithin"
         when adding documents.

         http://wiki.apache.org/solr/UpdateXmlMessages

         maxDocs - Maximum number of documents to add since the last
                   commit before automatically triggering a new commit.

         maxTime - Maximum amount of time in ms that is allowed to pass
                   since a document was added before automatically
                   triggering a new commit.
         openSearcher - if false, the commit causes recent index changes
           to be flushed to stable storage, but does not cause a new
           searcher to be opened to make those changes visible.

         If the updateLog is enabled, then it's highly recommended to
         have some sort of hard autoCommit to limit the log size.
      -->
    <autoCommit>
      <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
      <openSearcher>false</openSearcher>
    </autoCommit>

    <!-- softAutoCommit is like autoCommit except it causes a
         'soft' commit which only ensures that changes are visible
         but does not ensure that data is synced to disk.  This is
         faster and more near-realtime friendly than a hard commit.
      -->

    <autoSoftCommit>
      <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
    </autoSoftCommit>

    <!-- Update Related Event Listeners

         Various IndexWriter related events can trigger Listeners to
         take actions.

         postCommit - fired after every commit or optimize command
         postOptimize - fired after every optimize command
      -->

  </updateHandler>

  <requestHandler name="/select"
    class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <int name="rows">10</int>
      <str name="df">SearchableText</str>
      <str name="wt">xml</str>
    </lst>
  </requestHandler>

  <requestHandler name="/update"
    class="solr.UpdateRequestHandler" />

  <requestHandler name="/admin/ping"
    class="solr.PingRequestHandler">
    <lst name="invariants">
      <str name="q">solrpingquery</str>
    </lst>
    <lst name="defaults">
      <str name="echoParams">all</str>
    </lst>
  </requestHandler>

  <!-- enable file indexing use solr cell-->
  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
    regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/"
    regex="solr-cell-\d.*\.jar" />

  <requestHandler name="/update/extract"
    startup="lazy"
    class="solr.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
      <str name="uprefix">ignored_</str>
    </lst>
  </requestHandler>

  <!-- enable remote stream parsing -->
  <requestDispatcher>
    <requestParsers enableRemoteStreaming="true"
      multipartUploadLimitInKB="-1"
      formdataUploadLimitInKB="-1"
      addHttpRequestToContext="false" />
  </requestDispatcher>

</config>
fogstand commented 3 years ago

Ravishers-MacBook-Air:solr-8.6.3 ravishersingh$ bin/post -c mycore -filetypes html https://www.cdc.gov/lgbthealth/Transgender.htm java -classpath /Users/ravishersingh/downloads/solr-8.6.3/dist/solr-core-8.6.3.jar -Dauto=yes -Dfiletypes=html -Dc=mycore -Ddata=web org.apache.solr.util.SimplePostTool https://www.cdc.gov/lgbthealth/Transgender.htm SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/mycore/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings html Entering crawl at level 0 (1 links total, 1 new) SimplePostTool: WARNING: The URL https://www.cdc.gov/lgbthealth/Transgender.htm is disallowed by robots.txt and will not be crawled. SimplePostTool: WARNING: The URL https://www.cdc.gov/lgbthealth/Transgender.htm returned a HTTP result status of 403 0 web pages indexed. COMMITting Solr index changes to http://localhost:8983/solr/mycore/update/extract... SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/mycore/update/extract?commit=true SimplePostTool: WARNING: Response:

Error 404 Not Found

HTTP ERROR 404 Not Found

URI:/solr/mycore/update/extract
STATUS:404
MESSAGE:Not Found
SERVLET:default
Keep getting above error, can anyone point me in right direction.Thanks
tisto commented 2 years ago

@NicolasGoeddel @sauzher @1letter I'd like to support the stream.url use case out of the box (we moved to using Relstorage for one of our recent projects and that's incompatible with the blob on the fs). I guess we could make that a checkbox option in the control panel that integrators can turn on and off, what do you think?

sauzher commented 2 years ago

ok for me. as far it works safely OOTB it's ok.