Closed jacksonp2008 closed 4 years ago
It is indeed not rare to run into issues when you share the same committer-queue directory but want to send the data to different targets. Make sure to specify different queueDir
for each of your committers and that should resolve your issue. Please confirm.
I bet that's it! I am specifying logsDir & workDir, but not queueDir. I just pulled them into different directories and it also solved the issue. thanks
Using /norconex-collector-http-2.8.1
Directory structure is: root@es-airflow:~/norconex-collector-http-2.8.1# ls apidocs collector-http.bat committer-queue customer LICENSE.txt NOTICE.txt scripts classes collector-http.sh examples lib log4j.properties progress third-party
customer directory structure: site1 --> site1-config & site1-output site2 --> site2-config & site2-output site3 --> site3-config & site3-output
in each config there is a site1.xml, site2.xml crawler config's each using a different index. (site1 index...)
When I run multiple of these crawlers at the same time, I am seeing index bleed from one to another. That is, site1 index will have references from the site2 index that should not be there. I might also add that this is going to elasticsearch, and I am using the same field names in every crawler config.
I am suspicious of the commiter-queue, should I be using a fresh copy of the norconex-collector-http-2.8.1 directory structure for each site? (this would allow for a unique committer-queue perhaps)