VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
450 stars 135 forks source link

Ache Tor Crawler cannot create index in ElasticSearch, Please help. #203

Closed chanwitkepha closed 3 years ago

chanwitkepha commented 3 years ago

I have ACHE , Tor Proxy and ElasticSearch in single VM Server.

Tor proxy and ElasticSearch Work find. But when I run ache from docker compose, It show error in docker compose logs.

ache    | [2021-05-13 02:11:51,930] INFO [main] (TargetRepositoryFactory.java:57) - Loading repository with data_format=ELASTICSEARCH from /data/default/data_pages
ache    | [2021-05-13 02:11:52,546] INFO [main] (ElasticSearchClientFactory.java:42) - Initialized Elasticsearch REST client for hosts: [http://127.0.0.1:9200]
ache    | [2021-05-13 02:11:52,774] INFO [main] (ElasticSearchClientFactory.java:71) - [Content-Length: 548,Chunked: false]
ache    | [2021-05-13 02:11:52,789] INFO [main] (ElasticSearchRestTargetRepository.java:65) - Elasticsearch version: 7
ache    | [2021-05-13 02:11:52,815]ERROR [main] (Main.java:260) - Crawler execution failed: Failed to create index in Elasticsearch.
ache    |
ache    | java.lang.RuntimeException: Failed to create index in Elasticsearch.
ache    |       at achecrawler.target.repository.ElasticSearchRestTargetRepository.createIndexMapping(ElasticSearchRestTargetRepository.java:122)
ache    |       at achecrawler.target.repository.ElasticSearchRestTargetRepository.<init>(ElasticSearchRestTargetRepository.java:47)
ache    |       at achecrawler.target.TargetRepositoryFactory.createRepository(TargetRepositoryFactory.java:87)
ache    |       at achecrawler.target.TargetRepositoryFactory.create(TargetRepositoryFactory.java:34)
ache    |       at achecrawler.target.TargetStorage.create(TargetStorage.java:131)
ache    |       at achecrawler.crawler.async.AsyncCrawler.create(AsyncCrawler.java:117)
ache    |       at achecrawler.crawler.CrawlersManager.createCrawler(CrawlersManager.java:104)
ache    |       at achecrawler.Main$StartCrawl.run(Main.java:246)
ache    |       at achecrawler.Main.main(Main.java:59)
ache    | Caused by: org.elasticsearch.client.ResponseException: PUT http://127.0.0.1:9200/tor: HTTP/1.1 400 Bad Request
ache    | {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters:  [page : {properties={isRelevant={index=true, type=keyword}, crawlerId={index=true, type=keyword}, domain={index=true, type=keyword}, words={index=true, type=keyword}, wordsMeta={index=true, type=keyword}, retrieved={format=dateOptionalTime, type=date}, text={type=text}, title={type=text}, url={index=true, type=keyword}, relevance={type=double}, topPrivateDomain={index=true, type=keyword}}}]"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: Root mapping definition has unsupported parameters:  [page : {properties={isRelevant={index=true, type=keyword}, crawlerId={index=true, type=keyword}, domain={index=true, type=keyword}, words={index=true, type=keyword}, wordsMeta={index=true, type=keyword}, retrieved={format=dateOptionalTime, type=date}, text={type=text}, title={type=text}, url={index=true, type=keyword}, relevance={type=double}, topPrivateDomain={index=true, type=keyword}}}]","caused_by":{"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters:  [page : {properties={isRelevant={index=true, type=keyword}, crawlerId={index=true, type=keyword}, domain={index=true, type=keyword}, words={index=true, type=keyword}, wordsMeta={index=true, type=keyword}, retrieved={format=dateOptionalTime, type=date}, text={type=text}, title={type=text}, url={index=true, type=keyword}, relevance={type=double}, topPrivateDomain={index=true, type=keyword}}}]"}},"status":400}
ache    |       at org.elasticsearch.client.RestClient$1.completed(RestClient.java:354)
ache    |       at org.elasticsearch.client.RestClient$1.completed(RestClient.java:343)
ache    |       at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:123)
ache    |       at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
ache    |       at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
ache    |       at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
ache    |       at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
ache    |       at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
ache    |       at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
ache    |       at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
ache    |       at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
ache    |       at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
ache    |       at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
ache    |       at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
ache    |       at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
ache    |       at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
ache    |       at java.base/java.lang.Thread.run(Unknown Source)

My ACHE config file (ache.yml)

# Configure ELASTICSEARCH and FILES data formats
target_storage.data_formats:
  - ELASTICSEARCH
#  - FILES
#   - FILESYSTEM_JSON

target_storage.data_format.elasticsearch.rest.hosts:
  - http://127.0.0.1:9200

target_storage.data_format.elasticsearch.rest.connect_timeout: 30000
target_storage.data_format.elasticsearch.rest.socket_timeout: 30000
target_storage.data_format.elasticsearch.rest.max_retry_timeout_millis: 90000

# Basic configuration in-depth web site crawling
link_storage.link_strategy.use_scope: true
link_storage.link_strategy.outlinks: true
link_storage.scheduler.host_min_access_interval: 1000

# Configure ACHE to download .onion URLs through the TOR proxy container
crawler_manager.downloader.torproxy: http://127.0.0.1:8118

I use tor.seeds with some example .onion web from git repo.

http://zqktlwi4fecvo6ri.onion/wiki/index.php/Main_Page
http://e266al32vpuorbyg.onion/bookmarks.php
http://kpynyvym6xqi7wz2.onion/links.html
http://3fyb44wdhnd2ghhl.onion/wiki/index.php?title=Main_Page
http://3g2upl4pq6kufc4m.onion/
http://xmh57jrzrnw6insl.onion/
http://32rfckwuorlf4dlv.onion/
http://5plvrsgydwy2sgce.onion/
http://2vlqpcqpjlhmd5r2.onion/
http://nlmymchrmnlmbnii.onion/
http://wiki5kauuihowqi5.onion/
http://kpvz7ki2v5agwt35.onion/
http://idnxcnkne4qt76tg.onion/
http://torlinkbgs6aabns.onion/
http://jh32yv5zgayyyts3.onion/
http://wikitjerrta4qgz4.onion/
http://xdagknwjc7aaytzh.onion/
http://3fyb44wdhnd2ghhl.onion/
http://j6im4v42ur6dpic3.onion/
http://p3igkncehackjtib.onion/
http://kbhpodhnfxl3clb4.onion/
http://cipollatnumrrahd.onion/
http://dppmfxaacucguzpc.onion/

My docker-compose.yml file for ache.

services:
  ache:
    image: vidanyu/ache
    container_name: ache
    command: startCrawl -c /config/ -s /config/tor.seeds -o /data --elasticIndex tor
    ports:
      - "8080:8080"
    volumes:
      # mounts /config and /data directories to paths relative to path where this file is located
      - /tor-crawler-data:/data
      - ./:/config
    tty: true
    network_mode: host

My docker-compose file for ElasticSearch and Kibana.

version: '3'

services:
  elasticsearch-node1:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
    container_name: es-node1
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
     - "9200:9200"
    tty: true
    network_mode: host
    volumes:
      - elasticsearch-node1:/usr/share/elasticsearch/data

  kibana-node1:
    image: docker.elastic.co/kibana/kibana:7.12.0
    container_name: kibana-node1
    environment:
       - SERVER_NAME=kibana
       - SERVER_HOST=192.168.11.230
       - ELASTICSEARCH_HOSTS=http://127.0.0.1:9200
    ports:
     - "5601:5601"
    tty: true
    network_mode: host
volumes:
    elasticsearch-node1:
aecio commented 3 years ago

Does it work on older versions or did you test only with Elasticsearch 7.12.0? I haven't tested ACHE with recent versions of Elasticsearch, so it might be some this is caused by some changes in ES v7. If this is the case, some changes in the index mappings might be necessary to make it work. Elasticsearch index mappings currently hard-coded in this file https://github.com/VIDA-NYU/ache/blob/dev/ache/src/main/java/achecrawler/target/repository/ElasticSearchRestTargetRepository.java.

chanwitkepha commented 3 years ago

Does it work on older versions or did you test only with Elasticsearch 7.12.0? I haven't tested ACHE with recent versions of Elasticsearch, so it might be some this is caused by some changes in ES v7. If this is the case, some changes in the index mappings might be necessary to make it work. Elasticsearch index mappings currently hard-coded in this file https://github.com/VIDA-NYU/ache/blob/dev/ache/src/main/java/achecrawler/target/repository/ElasticSearchRestTargetRepository.java.

Thank you for your help. I also test with Elasticsearch and Kibana old version 6.8.14 and it work fine. However, if possible, please fix this problem. Thank you very much.

aecio commented 3 years ago

Opened issue #206 to track Elasticsearch 7.x support.