Closed Muffinman closed 6 years ago
Could you share the file named file-server service.sh
?
Oh that's just our service start/stop script.
#!/bin/bash
# This is designed called from Systemd service on system bootup and shutdown. It is not suitable for development use.
# Don't forget to setup log rotation for /var/log/printpath*.log
start() {
exec docker-compose -f /home/orphans/printpath/docker-compose.yml up
}
stop() {
exec docker-compose -f /home/orphans/printpath/docker-compose.yml down
}
case $1 in
start|stop) "$1" ;;
esac
And the docker-compose.yml
file for reference:
#############
#
# For use in all environments
#
# Common components (NGINX, PHP and MySQL) are presumed to be installed
# on the host machine so this only takes care of the niche requirements.
#
#############
version: "3.1"
services:
redis:
image: redis:alpine
container_name: printpath-redis
ports:
- "14235:6379"
volumes:
- "./docker/storage/redis:/data"
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:6.4.0
container_name: printpath-elasticsearch
environment:
- xpack.security.enabled=false
- discovery.type=single-node
- bootstrap.memory_lock=true
- http.cors.enabled=true
- http.cors.allow-origin=*
- http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE
- http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type, Content-Length, Authorization
- network.host=0.0.0.0
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
volumes:
- ./docker/storage/elasticsearch:/usr/share/elasticsearch/data:cached
ports:
- "14236:9200"
fscrawler_shared:
build:
context: ${PWD}/docker/build/fscrawler
container_name: printpath-fscrawler-shared
volumes:
- ${SHARED_PATH}:/usr/share/fscrawler/data:ro
- ./docker/storage/fscrawler/shared/:/usr/share/fscrawler/config/shared:cached
depends_on:
- elasticsearch
environment:
- WAIT_COMMAND=[ $$(curl --write-out %{http_code} --silent --output /dev/null http://elasticsearch:9200/_cat/health?h=st) = 200 ]
- WAIT_START_CMD=fscrawler --trace --config_dir /usr/share/fscrawler/config shared
- WAIT_SLEEP=2
- WAIT_LOOPS=10
command: bash wait_to_start.sh
fscrawler_archive:
build:
context: ${PWD}/docker/build/fscrawler
container_name: printpath-fscrawler-archive
volumes:
- ${ARCHIVE_PATH}:/usr/share/fscrawler/data:ro
- ./docker/storage/fscrawler/archive:/usr/share/fscrawler/config/archive:cached
depends_on:
- elasticsearch
environment:
- WAIT_COMMAND=[ $$(curl --write-out %{http_code} --silent --output /dev/null http://elasticsearch:9200/_cat/health?h=st) = 200 ]
- WAIT_START_CMD=fscrawler --trace --config_dir /usr/share/fscrawler/config archive
- WAIT_SLEEP=2
- WAIT_LOOPS=10
command: bash wait_to_start.sh
fscrawler_library:
build:
context: ${PWD}/docker/build/fscrawler
container_name: printpath-fscrawler-library
volumes:
- ${LIBRARY_PATH}:/usr/share/fscrawler/data:ro
- ./docker/storage/fscrawler/library:/usr/share/fscrawler/config/library:cached
depends_on:
- elasticsearch
environment:
- WAIT_COMMAND=[ $$(curl --write-out %{http_code} --silent --output /dev/null http://elasticsearch:9200/_cat/health?h=st) = 200 ]
- WAIT_START_CMD=fscrawler --trace --config_dir /usr/share/fscrawler/config library
- WAIT_SLEEP=2
- WAIT_LOOPS=10
command: bash wait_to_start.sh
The wait_to_start.sh
script just checks the output of the curl request, waiting for ES to be ready, before starting FSCrawler.
Sorry I meant Fiona-Cooke-Facebook-Amendments.pdf
.
Sent via email, thanks!
Hi @Muffinman
Sorry for the delay.
I just tried to parse your file with Tika and this worked well although it took several seconds, may be 20s, to parse it. I just tried with a Unit Test though. I need to try in a IT context. Stay tuned.
I tried in the context of an integration test and everything worked well. I have no idea of what is happening. Wondering if this could be a memory issue. Could you try with only this file?
I have a feeling it's something to do with running it in a docker container, but I'm not sure exactly why that's an issue.
Will keep looking into it and report back if I find the cause. Thanks for taking a look.
EDIT: My docker stats, look like memory isn't an issue?
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
12a1172e28ff printpath-fscrawler-shared 0.07% 1.381GiB / 31.3GiB 4.41% 5.1MB / 6.19MB 195MB / 9.42MB 24
19d272036ab8 printpath-fscrawler-library 0.07% 233.8MiB / 31.3GiB 0.73% 13.1MB / 16.2MB 3.06GB / 4.1kB 23
435f74e186c7 printpath-fscrawler-archive 0.07% 1.735GiB / 31.3GiB 5.54% 79.8MB / 96.6MB 12.8GB / 13.1MB 25
512fd36cd96a printpath-elasticsearch 2.44% 915.9MiB / 31.3GiB 2.86% 119MB / 101MB 153MB / 71.6MB 65
Actually one thing occurred to me which may be an issue.
The directory indexing seems to be fast and completes within a few minutes and then shows this:
08:16:01,133 DEBUG [f.p.e.c.f.FsParser] Fs crawler is going to sleep for 1d
08:16:01,570 TRACE [f.p.e.c.f.c.ElasticsearchClientManager] Sending a bulk request of [79] requests
08:16:01,605 TRACE [f.p.e.c.f.c.ElasticsearchClientManager] Executed bulk request with [79] requests
Could it be the folder crawling is finishing and putting the FSCrawler processes to sleep before file crawling has finished? I'm not sure how the internals work.
look like memory isn't an issue?
Yeah right.
Executed bulk request with [79] requests
Means that all documents have been sent to elasticsearch and have been accepted.
133 DEBUG [f.p.e.c.f.FsParser] Fs crawler is going to sleep for 1d
Could it be the folder crawling is finishing and putting the FSCrawler processes to sleep before file crawling has finished?
This means that FSCrawler has parsed everything needed and is going to sleep the Crawling thread. But it does no stop the bulk processor thread which is effectively indexing the pending documents.
Everything looks good here IMO.
Not sure what else I can add for now on this. I'm going to close as everything looks good but feel free to reopen and/or add more comments.
Version: 2.6-SNAPSHOT
It seems to index directories fine, however it seems to only index a few files and then hang.
I see the following in the output with
--trace
on (note I'm running this inside a docker container and as a systemd service):That process never seems to finish and is not using any CPU time.
When I inspect the processes directly I see the following:
Seems like the problem may be with Tika never returning it's meta data, but it doesn't seem to be logging or printing anything. Any ideas?