Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Push index to AWS OpenSearch through tunneling #117

Open caoyang1211 opened 2 years ago

caoyang1211 commented 2 years ago

I created an OpenSearch domain on AWS inside a VPC. To read from or write data to the domain from my laptop, I have to run SecureCRT, create a session for a Bastion server that is running on AWS and has access to the VPC, and set up port forwarding to redirect traffic https://127.0.0.1:60443 to the OpenSearch domain https://vpc-*****-us-east-1-1-1-yzdjblkpyhbyaytxcnbzrxavpm.us-east-1.es.amazonaws.com.

I verified that port forwarding was working by running a curl command to index multiple JSON files into that OpenSearch domain with a --insecure option to ignore certificate checks. The command is like this: curl -H "Content-Type:application/json" --insecure -XPOST "https://127.0.0.1:60443/_bulk" --data-binary @TutorialVideoDbRecords.json"

To use the Norconex crawler to index web pages to the OpenSearch domain, I set https://localhost:60443 in the Norconex config file and ran the crawler, it reported "Failure occured on node: "null" and "Host name 'localhost' does not match the certificate subject provided by the peer (CN=*.us-east-1.es.amazonaws.com)"

So it looks like the problem is caused by a certificate validation failure. Is there an option in the config file that can ignore certificate checks like the --insecure option in the curl command? My configuration is as following, no user credential is required to access the OpenSearch domain inside the VPC :

        <committer class="ElasticsearchCommitter">
            <nodes>https://localhost:60443</nodes>
            <indexName>tutorials_videos</indexName>
        </committer>

The error message I got after running the crawler is as following: 0:42:22.020 [tutorial video#1] ERROR ElasticsearchCommitter - Failure occured on node: "null". Check node logs. 10:42:22.022 [tutorial video#1] ERROR COMMITTER_BATCH_ERROR - CommitterEvent[connectionTimeout=1000,credentials=Credentials[username=,password=****,passwordKey=],discoverNodes=false,dotReplacement=,fixBadIds=false,ignoreResponseErrors=false,indexName=tutorials_videos,jsonFieldsPattern=,socketTimeout=30000,sourceIdField=,targetContentField=content,typeName=,queue=FSQueue[batchSize=20,commitLeftoversOnInit=false,ignoreErrors=false,maxPerFolder=500,retrier=Retrier[exceptionFilter=,maxCauses=10,maxRetries=0,retryDelay=0],splitBatch=OFF],committerContext=CommitterContext[eventManager=com.norconex.commons.lang.event.EventManager@6a5e167a,streamFactory=com.norconex.commons.lang.io.CachedStreamFactory@60e06f7d,workDir=.\tutorial_video\MP_32_Collector\tutorial_32_video\committer\0],fieldMappings=com.norconex.committer.core3.CommitterException: Could not commit JSON batch to Elasticsearch.,restrictions=[],request=] 10:42:22.127 [tutorial video#2] INFO DOCUMENT_COMMITTED_UPSERT - https://kidshealth.org/en/teens/center/concussions-ctr.html - Committers: ElasticsearchCommitter 10:42:22.129 [tutorial video#1] ERROR COMMITTER_UPSERT_ERROR - CommitterEvent[connectionTimeout=1000,credentials=Credentials[username=,password=****,passwordKey=],discoverNodes=false,dotReplacement=,fixBadIds=false,ignoreResponseErrors=false,indexName=tutorials_videos,jsonFieldsPattern=,socketTimeout=30000,sourceIdField=,targetContentField=content,typeName=,queue=FSQueue[batchSize=20,commitLeftoversOnInit=false,ignoreErrors=false,maxPerFolder=500,retrier=Retrier[exceptionFilter=,maxCauses=10,maxRetries=0,retryDelay=0],splitBatch=OFF],committerContext=CommitterContext[eventManager=com.norconex.commons.lang.event.EventManager@6a5e167a,streamFactory=com.norconex.commons.lang.io.CachedStreamFactory@60e06f7d,workDir=.\tutorial_video\MP_32_Collector\tutorial_32_video\committer\0],fieldMappings=com.norconex.committer.core3.batch.queue.CommitterQueueException: Could not process one or more files form committer batch located at C:\Users\pantr\eclipse-workspace\medlineplus-crawler-http.\tutorial_video\MP_32_Collector\tutorial_32_video\committer\0\queue\batch-1656600137136000000. Moved them to error directory: C:\Users\pantr\eclipse-workspace\medlineplus-crawler-http.\tutorial_video\MP_32_Collector\tutorial_32_video\committer\0\error,restrictions=[],request=UpsertRequest[reference=https://kidshealth.org/en/parents/pilonidal_gips_animation.html]] 10:42:22.129 [tutorial video#1] ERROR CrawlerCommitterService - Could not execute "upsert" on committer: ElasticsearchCommitter[connectionTimeout=1000,credentials=Credentials[username=,password=****,passwordKey=],discoverNodes=false,dotReplacement=,fixBadIds=false,ignoreResponseErrors=false,indexName=tutorials_videos,jsonFieldsPattern=,socketTimeout=30000,sourceIdField=,targetContentField=content,typeName=,queue=FSQueue[batchSize=20,commitLeftoversOnInit=false,ignoreErrors=false,maxPerFolder=500,retrier=Retrier[exceptionFilter=,maxCauses=10,maxRetries=0,retryDelay=0],splitBatch=OFF],committerContext=CommitterContext[eventManager=com.norconex.commons.lang.event.EventManager@6a5e167a,streamFactory=com.norconex.commons.lang.io.CachedStreamFactory@60e06f7d,workDir=.\tutorial_video\MP_32_Collector\tutorial_32_video\committer\0],fieldMappings={},restrictions=[]] com.norconex.committer.core3.batch.queue.CommitterQueueException: Could not process one or more files form committer batch located at C:\Users\pantr\eclipse-workspace\medlineplus-crawler-http.\tutorial_video\MP_32_Collector\tutorial_32_video\committer\0\queue\batch-1656600137136000000. Moved them to error directory: C:\Users\pantr\eclipse-workspace\medlineplus-crawler-http.\tutorial_video\MP_32_Collector\tutorial_32_video\committer\0\error at com.norconex.committer.core3.batch.queue.impl.FSQueue.moveUnrecoverableBatchError(FSQueue.java:429) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:364) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeBatchDirectory(FSQueue.java:338) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.queue(FSQueue.java:331) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.AbstractBatchCommitter.doUpsert(AbstractBatchCommitter.java:87) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.AbstractCommitter.upsert(AbstractCommitter.java:215) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.collector.core.crawler.CrawlerCommitterService.lambda$upsert$1(CrawlerCommitterService.java:84) ~[norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.crawler.CrawlerCommitterService.executeAll(CrawlerCommitterService.java:129) [norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.crawler.CrawlerCommitterService.upsert(CrawlerCommitterService.java:80) [norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:30) [norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.pipeline.committer.CommitModuleStage.execute(CommitModuleStage.java:24) [norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) [norconex-commons-lang-2.0.0.jar:2.0.0] at com.norconex.collector.http.crawler.HttpCrawler.executeCommitterPipeline(HttpCrawler.java:388) [norconex-collector-http-3.0.0.jar:3.0.0] at com.norconex.collector.core.crawler.Crawler.processImportResponse(Crawler.java:681) [norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.crawler.Crawler.processNextQueuedCrawlData(Crawler.java:614) [norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:556) [norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:923) [norconex-collector-core-2.0.0.jar:2.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: com.norconex.committer.core3.batch.queue.CommitterQueueException: Could not consume batch. Number of attempts: 1 at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:407) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:356) ~[norconex-committer-core-3.0.0.jar:3.0.0] ... 18 more Caused by: com.norconex.commons.lang.exec.RetriableException: Execution failed, maximum number of retries reached. at com.norconex.commons.lang.exec.Retrier.execute(Retrier.java:204) ~[norconex-commons-lang-2.0.0.jar:2.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:395) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:356) ~[norconex-committer-core-3.0.0.jar:3.0.0] ... 18 more Caused by: com.norconex.committer.core3.CommitterException: Could not commit JSON batch to Elasticsearch. at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:543) ~[norconex-committer-elasticsearch-5.0.0.jar:5.0.0] at com.norconex.committer.core3.batch.AbstractBatchCommitter.consume(AbstractBatchCommitter.java:112) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.lambda$consumeRetriableBatch$1(FSQueue.java:398) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.commons.lang.exec.Retrier.execute(Retrier.java:177) ~[norconex-commons-lang-2.0.0.jar:2.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:395) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:356) ~[norconex-committer-core-3.0.0.jar:3.0.0] ... 18 more Caused by: java.io.IOException: Host name 'localhost' does not match the certificate subject provided by the peer (CN=.us-east-1.es.amazonaws.com) at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:901) ~[elasticsearch-rest-client-7.16.2.jar:7.16.2] at org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) ~[elasticsearch-rest-client-7.16.2.jar:7.16.2] at org.elasticsearch.client.RestClient.performRequest(RestClient.java:276) ~[elasticsearch-rest-client-7.16.2.jar:7.16.2] at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:537) ~[norconex-committer-elasticsearch-5.0.0.jar:5.0.0] at com.norconex.committer.core3.batch.AbstractBatchCommitter.consume(AbstractBatchCommitter.java:112) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.lambda$consumeRetriableBatch$1(FSQueue.java:398) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.commons.lang.exec.Retrier.execute(Retrier.java:177) ~[norconex-commons-lang-2.0.0.jar:2.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeRetriableBatch(FSQueue.java:395) ~[norconex-committer-core-3.0.0.jar:3.0.0] at com.norconex.committer.core3.batch.queue.impl.FSQueue.consumeSplitableBatchDirectory(FSQueue.java:356) ~[norconex-committer-core-3.0.0.jar:3.0.0] ... 18 more Caused by: javax.net.ssl.SSLPeerUnverifiedException: Host name 'localhost' does not match the certificate subject provided by the peer (CN=.us-east-1.es.amazonaws.com) at org.apache.http.nio.conn.ssl.SSLIOSessionStrategy.verifySession(SSLIOSessionStrategy.java:209) ~[httpasyncclient-4.1.4.jar:4.1.4] at org.apache.http.nio.conn.ssl.SSLIOSessionStrategy$1.verify(SSLIOSessionStrategy.java:188) ~[httpasyncclient-4.1.4.jar:4.1.4] at org.apache.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:360) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.nio.reactor.ssl.SSLIOSession.isAppInputReady(SSLIOSession.java:523) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:120) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.12.jar:4.4.12] at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.12.jar:4.4.12] ... 1 more 10:42:22.149 [tutorial video#1] INFO Crawler - Could not process document: https://kidshealth.org/en/parents/pilonidal_gips_animation.html (Could not execute "upsert" on 1 committer(s): "ElasticsearchCommitter". Check the logs for more details.)

essiembre commented 2 years ago

Do you still have the issue? I do not think there is a way to ignore certificates for now. I think your best bet would be to give direct access from your crawler server to your OpenSearch instance (shared VPC, IP whitelisting, etc.).