norconex-importer Search Results

413 results
for norconex-importer

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

Norconex/crawlers #235

crawling large URL list

Some weird stuff happens when I crawl more than 1000 urls. Originally, I set it up with 440,000 urls , single crawler. Started it. But no INFO messages appear like "DOCUMENT_IMPORTED" or "REJECTED_FIL…

nuliknol updated 8 years ago
7
Norconex/crawlers #242

Mongo exception when crawling files of contenttype applicati…

My Crawler Name: 2016-04-20 14:41:50 ERROR - My Crawler Name: Could not mark reference as processed: URL (can't serialize class com.norconex.commons.lang.file.ContentType) java.lang.IllegalArgumentExc…

OkkeKlein updated 8 years ago
10
Norconex/crawlers #221

PDF parsing - TikaException: zip bomb detected

hi! some PDFs still cannot be parsed: ``` www.db.com: 2016-01-18 11:54:17 WARN - Could not import https://www.db.com/ir/en/download/DB_Interim_Report_1Q2015.pdf com.norconex.importer.parser.DocumentP…

jetnet updated 8 years ago
6
Norconex/crawlers #237

Strip text from HTML body - best practice

I'm using the Norconex HTTP Collector to crawl HTML files and send certain meta-fields and the content text to a Solr server. What I now want to do is to only send text from the content to the Solr s…

V3RITAS updated 8 years ago
8
Norconex/crawlers #238

Special characters in binary files

We are using the Norconex HTTP Collector to crawl HTML & binary files and send certain meta-fields and the content text to a Solr server. Overall the extraction of text from binary files (PDF, Powerp…

V3RITAS updated 8 years ago
5
Norconex/crawlers #216

PDF cannot be parsed: EnhancedPDFParser - NullPointerExcepti…

Hi Pascal, I've got a lot of PDFs, which cannot be imported, because of a NullPointerException in EnhancedPDFParser , e.g.: ``` test: 2016-01-11 10:45:23 DEBUG - Could not import https://japan.db.co…

jetnet updated 8 years ago
3
Norconex/crawlers #228

collect only html files

How do I collect only .html files and skip any other ? It should be done with extensionreferencefilter but there is no documentation about it. TIA

nuliknol updated 8 years ago
7
Norconex/crawlers #229

Messy code in 'collector.referrer-link-title'

We already have most of issues about messy code resolved, but still one remaining, here is my configuration. ``` ./www.hngzzx.com/progress ./www.hngzzx.com/logs …

bruce-genhot updated 8 years ago
3
Norconex/crawlers #230

treat ./ as /

I believe the path "./" in the url has to be threaded as "/" , i.e. , remove the dot "." . Because otherwise the crawler can go into infinite loop under specific conditions, just like happens when cra…

nuliknol updated 8 years ago
2
Norconex/crawlers #227

how to crawl wordpress pages?

Hi, is there anyway to collect on wordpress pages? i used the minimal xml file without results. THX

m-gorn updated 8 years ago
3

上一页 1...30 31 32 33 34 35 36...42 下一页

413 results for norconex-importer

413 results
for norconex-importer