Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

http collector index bleed over #650

Closed jacksonp2008 closed 4 years ago

jacksonp2008 commented 4 years ago

Using /norconex-collector-http-2.8.1

Directory structure is: root@es-airflow:~/norconex-collector-http-2.8.1# ls apidocs collector-http.bat committer-queue customer LICENSE.txt NOTICE.txt scripts classes collector-http.sh examples lib log4j.properties progress third-party

customer directory structure: site1 --> site1-config & site1-output site2 --> site2-config & site2-output site3 --> site3-config & site3-output

in each config there is a site1.xml, site2.xml crawler config's each using a different index. (site1 index...)

When I run multiple of these crawlers at the same time, I am seeing index bleed from one to another. That is, site1 index will have references from the site2 index that should not be there. I might also add that this is going to elasticsearch, and I am using the same field names in every crawler config.

I am suspicious of the commiter-queue, should I be using a fresh copy of the norconex-collector-http-2.8.1 directory structure for each site? (this would allow for a unique committer-queue perhaps)

essiembre commented 4 years ago

It is indeed not rare to run into issues when you share the same committer-queue directory but want to send the data to different targets. Make sure to specify different queueDir for each of your committers and that should resolve your issue. Please confirm.

jacksonp2008 commented 4 years ago

I bet that's it! I am specifying logsDir & workDir, but not queueDir. I just pulled them into different directories and it also solved the issue. thanks