-
Using stripBetween transformer to delete headers and footers from documents in preParseHandlers
For most documents all is fine, but for some specified pages footer is not removed. On all pages markup…
-
I have a database of URLs relevant to one or more health topic. I am indexing these existing health topics, for which I've written:
* An URL provider that returns them from a database
* A tagger …
-
Hi, I have ~4.7M files already indexed and re-ran the crawler to see how long it would take to crawl on second attempt. The first (initial) crawl took 1 day 10 hours. The second attempt I started la…
-
Hello all,
While crawling a huge website, sometimes I would ran into having troubles with the id of my document being to large (in case of cloudsearch for example).
I wanted to know if it's pos…
-
I'm attempting to crawl a password protected wiki that we use for internal documentation and I'm struggling with getting authentication to work. I've tried to use form authentication as well as basic…
-
I have 4 requirements to configure specific aspects of ES:
1. Set field limit to 2000
2. Create custom analyzers and tokenizers
3. Create nested fields
4. Set specific field properties (document…
-
Hi Pascal,
I was taking a look at the XMLFileCommitter in this webpage: https://www.norconex.com/collectors/committer-core/latest/apidocs/com/norconex/committer/core/impl/XMLFileCommitter.html
a…
-
Hello community,
I'm new to Norconex and ended up doing this for trying to optimize my website crawling scenario:
```
java -server -Xms2048m -Xmx2048m -XX:NewSize=512m -XX:MaxNewSize=512m -XX:P…
-
I'm using Norconex crawler on facebook Graph API /events/ and it is crawling down the data, but when it commits it to the elastic kibana sees the data in one block, so it cannot "index" it.
As I kn…
-
I'm very new to Norconex and am trying to configure it to crawl a site and add it to an existing Solr index. I've got a lot of issues, but I'll start with this one. When I run the crawler, it is inclu…
dkh7m updated
7 years ago