Norconex / committer-solr

Solr implementation of Norconex Committer. Should also work with any Solr-based products, such as LucidWorks.
https://opensource.norconex.com/committers/solr/
Apache License 2.0
3 stars 5 forks source link

Passing arguments to Solr update calls #3

Closed essiembre closed 9 years ago

essiembre commented 9 years ago

From @csaezl, originally posted on https://github.com/Norconex/collector-http/issues/74#issuecomment-90225426:

Talking again about /update parameters, is a way of passing update.chain=langid to Solr in HTTP Collector call?

essiembre commented 9 years ago

You can hard-code additional arguments to the update URLs used when invoking Solr by adding the solrUpdateURLParams tag, like this.

<committer class="com.norconex.committer.solr.SolrCommitter">

   ... your existing config ...

  <!-- add this: -->
  <solrUpdateURLParams>
     <param name="update.chain">langid</param>
  </solrUpdateURLParams>
</committer>

You can also add arguments for deletes. Refer to the SolrCommitter for all options.

csaezl commented 9 years ago

I did try <param name="langid">update.chain</param> but, of course, id didn't work. Thank you, Carlos

csaezl commented 9 years ago

I've tried:

<committer class="com.norconex.committer.solr.SolrCommitter">
   ...
  <solrUpdateURLParams>
     <param name="update.chain">langid</param>
  </solrUpdateURLParams>
</committer>

and don't work for me. It does work:

curl 'http://localhost:8983/solr/MyRepository/update?update.chain=langid'
--data-binary @data.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid'
--data-binary '<commit/>' -H 'Content-type:application/xml'
essiembre commented 9 years ago

This issue has been fixed in a new 2.0.2 snapshot release. Please give it a try an confirm.

csaezl commented 9 years ago

It does work, but ... At the beginning of the run two error appeared. From the log:

...
INFO - EuropeanUnion crawler: Initializing sitemap store...
INFO - EuropeanUnion crawler: Done initializing sitemap store.
INFO - Resolving sitemap: http://europa.eu/sitemap_index.xml
ERROR - Cannot fetch sitemap: http://europa.eu/sitemap_index.xml -- Likely an invalid sitemap XML format causing a parsing error (actual error: Unexpected character '-' (code 45) in external DTD subset; expected closing '>' after ENTITY declaration
 at [row,col,system-id]: [31,3,"http://www.w3.org/TR/html4/loose.dtd"]
 from [row,col {unknown-source}]: [1,1]).
INFO - Resolving sitemap: http://europa.eu/sitemap.xml
ERROR - Cannot fetch sitemap: http://europa.eu/sitemap.xml -- Likely an invalid sitemap XML format causing a parsing error (actual error: Unexpected character '-' (code 45) in external DTD subset; expected closing '>' after ENTITY declaration
 at [row,col,system-id]: [31,3,"http://www.w3.org/TR/html4/loose.dtd"]
 from [row,col {unknown-source}]: [1,1]).
INFO -           CRAWLER_STARTED (Subject: com.norconex.collector.http.crawler.HttpCrawler@5b785fd1)
INFO - EuropeanUnion crawler: Crawling references...
INFO -          DOCUMENT_FETCHED: http://europa.eu/index_en.htm (Subject: com.norconex.collector.http.fetch.impl.GenericDocumentFetcher@4581bd26)
...

From this point on, it seemed to work fine.

essiembre commented 9 years ago

Yes, those errors do not represent an issue with the the Solr Committer or the Collector you use. They are errors because of something wrong with the site you crawl. By default the HTTP Collector will check if a sitemap.xml file exists at the standard location. In this case, accessing http://europa.eu/sitemap.xml generates a redirect to a non-sitemap page (something other than XML). Hence the error. I recommend one of two things:

<sitemap ignore="true" />

(If you have follow-up questions about sitemaps, please open them here.)

essiembre commented 9 years ago

2.0.2 was just released with this fix. Closing.