-
Some weird stuff happens when I crawl more than 1000 urls.
Originally, I set it up with 440,000 urls , single crawler. Started it. But no INFO messages appear like "DOCUMENT_IMPORTED" or "REJECTED_FIL…
-
My Crawler Name: 2016-04-20 14:41:50 ERROR - My Crawler Name: Could not mark reference as processed: URL (can't serialize class com.norconex.commons.lang.file.ContentType)
java.lang.IllegalArgumentExc…
-
hi!
some PDFs still cannot be parsed:
```
www.db.com: 2016-01-18 11:54:17 WARN - Could not import https://www.db.com/ir/en/download/DB_Interim_Report_1Q2015.pdf
com.norconex.importer.parser.DocumentP…
-
I'm using the Norconex HTTP Collector to crawl HTML files and send certain meta-fields and the content text to a Solr server.
What I now want to do is to only send text from the content to the Solr s…
-
We are using the Norconex HTTP Collector to crawl HTML & binary files and send certain meta-fields and the content text to a Solr server.
Overall the extraction of text from binary files (PDF, Powerp…
-
Hi Pascal,
I've got a lot of PDFs, which cannot be imported, because of a NullPointerException in EnhancedPDFParser , e.g.:
```
test: 2016-01-11 10:45:23 DEBUG - Could not import https://japan.db.co…
-
How do I collect only .html files and skip any other ?
It should be done with extensionreferencefilter but there is no documentation about it.
TIA
-
We already have most of issues about messy code resolved, but still one remaining, here is my configuration.
```
./www.hngzzx.com/progress
./www.hngzzx.com/logs
…
-
I believe the path "./" in the url has to be threaded as "/" , i.e. , remove the dot "." . Because otherwise the crawler can go into infinite loop under specific conditions, just like happens when cra…
-
Hi,
is there anyway to collect on wordpress pages? i used the minimal xml file without results.
THX