-
Loading documents into Solr with HTTP Collector, due to an issue, the computer restarted. Just to be sure, what is the offical advice to continue the process where HTTP Collector was interrupted?.
Aft…
-
After running a crawler with `3` and just one URL, I have analysed the log and noticed that several URL are processed several times via the events: `DOCUMENT_FETCHED, CREATED_ROBOTS_META, URLS_EXTRAC…
-
I'm using the latest Norconex Http collector. By default the importer removes Html elements and just keeps the body text.
How do I configure it to keep specific Html elements. For example,I would lik…
-
Is it possible to only remove documents with 404 status code?
(and also log the broken link)
-
Documents from other simultanious running jobs are added as ICommitOperation to the jobs commiter when using a commiter based on AbstractMappedCommitter.
Using a simpel Commiter like this will log ou…
-
From @leonardsaers, java 8 was required to make the latest shanshot work. See ticket https://github.com/Norconex/collector-http/issues/66#issuecomment-85087299
Java 7 should be supported.
-
I have a strange behaviour where pages are added for indexing if it's new and deleted if it has been crawled before.
The expected behaviour should be to skip indexing if page is unmodied or index if …
-
Since it is not unusual that such types of files don't have title, author, subject, etc., I'm wondering if there is a way of capturing about (say) 100 characters or so from the beginning of the docume…
-
From @csaezl, originally posted on https://github.com/Norconex/collector-http/issues/74#issuecomment-90225426:
> Talking again about /update parameters, is a way of passing update.chain=langid to So…
-