-
I just ran the following simple crawler:
```
./tests-output/testattribute/progress
./tests-output/testattribute/logs
http://avax.news/fact/The_Day_in_Photos_Ju…
-
I make it get into an infinite loop, with these rules:
``` xml
jpg,gif,png,ico,css,js,gz,bz,tgz
http://.*nz/.*
http://.*nz
```
This is the log file:
``` xml
[niko@dev1 norconex-collector-…
-
I've:
- downloaded latest code from: http://www.norconex.com/collectors/collector-http/download
- unzipped the file
- gone to the root of the unzipped dir
- run: . ./collector-http.sh -a start -c exam…
-
I would like to parse the Last-Modified date so it fits the format expected by the Solr TrieDateField class:
```
YYYY-MM-DDThh:mm:ssZ
```
(https://cwiki.apache.org/confluence/display/solr/Working+wi…
-
Hi,
I've used Heritrix for a while, so I understand how to crawl websites. But since I'm not satisfied with Heritrix, I'm currently looking at alternative.
Norconex's API docs are good and the XML c…
-
Hello,
I have follow configuration (file is attached)
[config.txt](https://github.com/Norconex/collector-http/files/335445/config.txt)
Now, I run the job, and as result see:
INFO [CrawlerEventMana…
-
Hi Pascal,
I'm doing a little project with norconex http collector which will fetch news that with my city in the keywords field of metadata from big news website .
The fetching works well but the …
-
I wonder if there is a way to get the url which contains the fetched url.
-
hi Pascal,
you wont believe, but I just found another encoding issue :smile:
Source code:
``` html
```
The parser cannot recognize the content correctly (for HTML entities is used UTF-8 an…
-
I want importer accept only pages which url match regexp from config.
I believe `java Class RegexMetadataFilter` does that. The question is: which metadata field match page's url and does it exist at …