VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
449 stars 135 forks source link

Add support for Elasticsearch 7.x and 8.x indexing #282

Closed JuliusHenke closed 2 years ago

JuliusHenke commented 2 years ago

Hi, thanks for this cool project. I have added support for indexing pages in ES 7 & 8. I did not update the ACHE frontend search functionality, which likely also needs changes for full compatibility. But I think people will already enjoy the indexing part, since the newer ES versions >= 7.8.0 also come with docker images supporting linux/arm64/v8 OS/architecture (crucial for M1 Macbook users).

I have tested indexing with Elasticsearch version 7.17.0 and 8.2.2.

I used the following docker-compose.yml. Note that the xpack.security config values are important for local development without any security.

version: '2'
services:
  ache:
    image: vidanyu/ache
    command: startCrawl -c /config/ -s /config/docker.seeds -o /data -e crawl-data
    ports:
    - "8080:8080"
    volumes:
    - ./data-ache/:/data
    - ./:/config
    links:
    - elasticsearch
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.2.2
    environment:
      - discovery.type=single-node
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - xpack.security.enabled=false
      - xpack.security.transport.ssl.enabled=false
      - xpack.security.http.ssl.enabled=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ./data-es/:/usr/share/elasticsearch/data # elasticsearch data will be stored at ./data-es/
    ports:
      - "9200:9200"

Until this pull request is accepted you may need to build your own image based on my changes and reference it in the docker-compose.yml instead of vidanyu/ache.

I am open for changes, but currently am not planning on doing the frontend part as well. Cheers

aecio commented 2 years ago

Thanks for the contribution. Can you please rebase on the master branch and squash the commits? For some reason, GitHub is not showing the diff correctly.

aecio commented 2 years ago

Just noticed this is due to commit c97fde2 which changes the line separators. Do you happen to know what were the line separators before and after the commit?

JuliusHenke commented 2 years ago

Just noticed this is due to commit c97fde2 which changes the line separators. Do you happen to know what were the line separators before and after the commit?

Before the commit they were CRLF and after it LF. I changed it, since most of the other code I looked at is formatted with LF (Unix and macOS) and my IDE showed a warning. Should I revert the c97fde2 commit or do the squashing?

aecio commented 2 years ago

No need for change. GitHub can squash all commits during the merge. Thanks for the PR!