-
This error happens with the seed URL for the site, so no document in the site is processed. What can I do?
```
MC(crawler): 2015-05-05 18:57:27 ERROR - Cannot fetch sitemap: http://valitsus.ee/sitema…
-
Here is an example of the error message the collector generates when it encounters a link that points to a page that no longer exist.
ERROR [AbstractCrawler] Norconex Minimum Test Page: Could not pr…
-
Loading documents into Solr with HTTP Collector, due to an issue, the computer restarted. Just to be sure, what is the offical advice to continue the process where HTTP Collector was interrupted?.
Aft…
-
It seems like a library is missing for MP4 parsing:
Exception in thread "pool-1-thread-1" INFO [FilesystemCrawler] Projects: Re-processing orphan Files (if any)...
java.lang.NoClassDefFoundError: org…
-
After running a crawler with `3` and just one URL, I have analysed the log and noticed that several URL are processed several times via the events: `DOCUMENT_FETCHED, CREATED_ROBOTS_META, URLS_EXTRAC…
-
The regex in `.*test.*` is never passed to the importerhandler. Only field value.
-
Since it is not unusual that such types of files don't have title, author, subject, etc., I'm wondering if there is a way of capturing about (say) 100 characters or so from the beginning of the docume…
-
Almost all documents crawled by HTTP Collector have information about its language, but some PDF, DOC, etc may not have metadata because the authors don't register such type of information.
In this ca…
-
Hi, I'm trying to gater information about links: the text near che anchor.
I'm using:
norconex-collector-http-2.0.2.zip with openjdk-7
I have this definition:
```
text/htm…
-
I have a strange behaviour where pages are added for indexing if it's new and deleted if it has been crawled before.
The expected behaviour should be to skip indexing if page is unmodied or index if …