-
Using Norconex HTTP Collector + Elasticsearch commiter.
```
[non-job]: 2018-05-22 14:58:15 INFO - Version: Norconex HTTP Collector 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2018-05-22 14:58:15 INFO…
-
Hi
I would like to extract experts contact information from a site which dynamically generates list of available experts.
I saved these dynamically created sites into webpages-list containing fo…
-
Hi,
If I need reference of original input file in my own committer, how can I get that ?
Coz the add method contains reference of inputstream after tika extraction. What I want is original input fil…
-
I have a question regarding continuous crawling (or scheduling for that matter). I've read your post regarding the similar topics here: https://github.com/Norconex/collector-http/issues/93. But it doe…
-
Hello! In reference to [#370](https://github.com/Norconex/collector-http/issues/370), I am trying to eliminate the MENU section of my HTML code, however, I am experiencing issues using the example pr…
-
I need to extract only a certain type of files from a repository, for example the .pdf, ppt, ... I am using this configuration but it does not work.
```xml
#set($http = "com.norconex.collect…
-
Tried this:
OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(outputPath), Charset.forName("UTF-8").newEncoder());
After crawler stops, tried saving ...
`FilesystemCollec…
-
I have a workflow problem. I want to "resume" my crawler every day, and then let it run most of the day, and then "stop" my crawler.
However, the collector JVM is no longer executing when it is d…
-
Crawling some urls with the following configuration (see below) works the crawler just fine. But with a few common urls it gives unexpectedly the error message (The real url name is intentionally chan…
evaso updated
6 years ago
-
I am having issues isolating different crawlers to different types of documents so i can commit to elasticsearch. I want to be able to utilize the different for pdf, xml, html, images etc. What i wou…