-
Hey guys, could you provide me with a config for filtering out content except email addresses? I cant seem to figure this one out. Thanks!
-
Hello there!
I have been looking for some simple web crawler and I found this project and liked it very much. The problem is, that I can't find any useful tutorials for dummies and don't know how to …
-
I am attempting to use the httpClientFactory in a complex-config.xml and I am not seeing any attempt by the crawler to authenticate at the server. I have wireshark running, and it is just doing the …
-
hi there
after trying to work with your collector, that is very nice by the way, i am getting some erros that i can point to TIKA jar according to this post
[(https://issues.apache.org/jira/browse/…
-
Hi,
I've been using a lot DOMTagger and DOMSplitter in my crawlers, as I'm used to this way of simply extracting data from webpages (note: I come from the Heritrix world using XPathes...).
In the do…
-
```
java.lang.OutOfMemoryError: Java heap space
at com.norconex.commons.lang.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:357)
at com.norconex.commons.lang.io.Cached…
-
I've got the following as my crawler configuration:
```
http://wiki.linaro.org/
#parse("shared/importer-config.xml")
…
-
Hi, I encountered a problem as following:
I had 2 crawlers:
1) vnexpress crawler will commit data into vnexpress type in ElasticSearch.
2) dantri crawler will commit data into dantri type in ElasticSe…
-
I'm not sure it's worth opening ticket for this, but I've got remark on the way the various components of the Norconex's "solution" are distributed:
I recently had to use a collector-filesystem rather…
-
Following your suggestion in #26 I'm opening this new ticket to suggest the possibility to add a "fromField" in the DOMTagger, so that the user could split the original page into pieces with a first D…