-
Here you can see some exceptions I got:
```
MC (crawler): 2015-08-05 17:57:10 WARN - Could not queue extracted URL "http://www.feccoo-extremadura.org/ensenanzaextremadura/Areas_Comunes:Salud_Laboral_…
-
What is best approach to fix this?
Exception in thread "pool-1-thread-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.fontbox.ttf.TTFDataStream.readString(TTFDataStre…
-
Hi
i want to collect pages from rss feed
this is my crawler but no result
please help me
``` xml
./examples-output/minimum/progress
./examples-output/minimum/logs
4
1
-1
…
-
_Post from @csaezl, moved from https://github.com/Norconex/collector-http/issues/100#issuecomment-100172544_:
I've got an error, perhaps not related to the issue itself:
From the log:
```
INFO - Sen…
-
Does the crawler only look at HttpMetadataChecksummer or also the documentChecksummer to decide whether to redownload pages?
A combination of content and modified date would give better indication wh…
-
Hi
I would like to know how to configure the collector to collect only images reference in url
i am writing a custom committer that need to send to REST api all urls in a website that have images in t…
-
I have a collector with `` that processes the URL `http://www.fexb.es/`. At some time it processes `https://www.facebook.com/r.php?locale=es_ES` web page and others from facebook site.
There is anoth…
-
I need the crawler to reject some URLs, those that include `/../`. I use the filter:
```
#set($filterRegexRef = "com.norconex.collector.core.filter.impl.RegexReferenceFilter")
...
*/\.\./*
```…
-
How do I open mapdb files?
-
This issue derailed the OOM discussion, so a new issue was created.
I added some logging to the ReplaceTransformer and found out that certain PDF's have a concatenated string of the first line (7 tim…