VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

Question about custom crawls on startServer functionality #178

Open pkoloveas opened 5 years ago

pkoloveas commented 5 years ago

Is there a way to define specifically configured crawls while running ache on server mode? I am using a docker compose file which is structured as my docker-compose file with the startCrawl command but when I look into the data-ache folder, the config folder that has been created has the default ache.yml and no link-filters.yml file.

version: '2'
services:
  ache:
    image: vidanyu/ache
    entrypoint: sh -c 'sleep 10 && /ache/bin/ache startServer -d /data' -c /config/ 
    ports:
      - "8080:8080"
    volumes:
      - ./data-ache/:/data
      - ./:/config

Also, after the server is up, and I can create new crawls from the interface, how can I specify that for example, a new focused crawl has a different ache.yml file from a deep crawl that was in the first configuration when I ran the command, without restarting the docker container/server?

aecio commented 5 years ago

This is not supported yet. Currently, the server mode is able to start only two crawl modes (deep crawl, and focused crawl) which use the ache.yml file configured during start-up, with a few minimal configurations, necessary for deep crawls and focused crawls, overridden. These configs that are overridden come from these config files:

There is an ongoing pull request to support link filters in server mode via the REST API (#175). Support a custom ache.yml could be done in a similar way.

pkoloveas commented 5 years ago

I assume that if I build from source instead of Docker, I can change the two files that you referenced to be my default configs for deep and focused crawls, correct?

Also, is there a way to run more than one different startCrawl commands but have them on the same port, so I can monitor them on the same dashboard? (right now I get the error from Docker that the port is busy)

aecio commented 5 years ago

Correct, if you change these files and rebuild it should work (you can rebuild the docker image as well). Regarding multiple startCrawls in the same port/process, it not possible at this time.