Closed essiembre closed 9 years ago
You can hard-code additional arguments to the update URLs used when invoking Solr by adding the solrUpdateURLParams
tag, like this.
<committer class="com.norconex.committer.solr.SolrCommitter">
... your existing config ...
<!-- add this: -->
<solrUpdateURLParams>
<param name="update.chain">langid</param>
</solrUpdateURLParams>
</committer>
You can also add arguments for deletes. Refer to the SolrCommitter for all options.
I did try <param name="langid">update.chain</param>
but, of course, id didn't work.
Thank you,
Carlos
I've tried:
<committer class="com.norconex.committer.solr.SolrCommitter">
...
<solrUpdateURLParams>
<param name="update.chain">langid</param>
</solrUpdateURLParams>
</committer>
and don't work for me. It does work:
curl 'http://localhost:8983/solr/MyRepository/update?update.chain=langid'
--data-binary @data.xml -H 'Content-type:application/xml'
curl 'http://localhost:8983/solr/update?update.chain=langid'
--data-binary '<commit/>' -H 'Content-type:application/xml'
This issue has been fixed in a new 2.0.2 snapshot release. Please give it a try an confirm.
It does work, but ... At the beginning of the run two error appeared. From the log:
...
INFO - EuropeanUnion crawler: Initializing sitemap store...
INFO - EuropeanUnion crawler: Done initializing sitemap store.
INFO - Resolving sitemap: http://europa.eu/sitemap_index.xml
ERROR - Cannot fetch sitemap: http://europa.eu/sitemap_index.xml -- Likely an invalid sitemap XML format causing a parsing error (actual error: Unexpected character '-' (code 45) in external DTD subset; expected closing '>' after ENTITY declaration
at [row,col,system-id]: [31,3,"http://www.w3.org/TR/html4/loose.dtd"]
from [row,col {unknown-source}]: [1,1]).
INFO - Resolving sitemap: http://europa.eu/sitemap.xml
ERROR - Cannot fetch sitemap: http://europa.eu/sitemap.xml -- Likely an invalid sitemap XML format causing a parsing error (actual error: Unexpected character '-' (code 45) in external DTD subset; expected closing '>' after ENTITY declaration
at [row,col,system-id]: [31,3,"http://www.w3.org/TR/html4/loose.dtd"]
from [row,col {unknown-source}]: [1,1]).
INFO - CRAWLER_STARTED (Subject: com.norconex.collector.http.crawler.HttpCrawler@5b785fd1)
INFO - EuropeanUnion crawler: Crawling references...
INFO - DOCUMENT_FETCHED: http://europa.eu/index_en.htm (Subject: com.norconex.collector.http.fetch.impl.GenericDocumentFetcher@4581bd26)
...
From this point on, it seemed to work fine.
Yes, those errors do not represent an issue with the the Solr Committer or the Collector you use. They are errors because of something wrong with the site you crawl. By default the HTTP Collector will check if a sitemap.xml file exists at the standard location. In this case, accessing http://europa.eu/sitemap.xml generates a redirect to a non-sitemap page (something other than XML). Hence the error. I recommend one of two things:
<sitemap ignore="true" />
(If you have follow-up questions about sitemaps, please open them here.)
2.0.2 was just released with this fix. Closing.
From @csaezl, originally posted on https://github.com/Norconex/collector-http/issues/74#issuecomment-90225426: