dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.36k stars 299 forks source link

FSCrawler hangs half way through indexing #598

Closed Muffinman closed 6 years ago

Muffinman commented 6 years ago

Version: 2.6-SNAPSHOT

It seems to index directories fine, however it seems to only index a few files and then hang.

I see the following in the output with --trace on (note I'm running this inside a docker container and as a systemd service):

sudo journalctl -u printpath | grep printpath-fscrawler-shared | grep "TikaDocParser"`
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,541 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [._Fiona-Cooke-Facebook-Amendments.pdf]
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,541 TRACE [f.p.e.c.f.t.TikaDocParser] indexed chars [10000]
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,541 TRACE [f.p.e.c.f.t.TikaDocParser] Generating hash with [MD5]
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,541 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,550 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,550 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,550 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("X-Parsed-By", "org.apache.tika.parser.EmptyParser"));
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,550 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("resourceName", "._Fiona-Cooke-Facebook-Amendments.pdf"));
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,551 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Content-Type", "multipart/appledouble"));
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,551 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,561 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [Fiona-Cooke-Facebook-Amendments.pdf]
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,561 TRACE [f.p.e.c.f.t.TikaDocParser] indexed chars [10000]
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,561 TRACE [f.p.e.c.f.t.TikaDocParser] Generating hash with [MD5]
Sep 04 09:58:47 file-server service.sh[26394]: printpath-fscrawler-shared | 08:56:28,561 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction

That process never seems to finish and is not using any CPU time.

When I inspect the processes directly I see the following:

strace -ffp 32095    # strace the fscrawler master process and children
strace: Process 32095 attached with 23 threads
[pid 32198] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 32196] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32197] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32195] epoll_pwait(198,  <unfinished ...>
[pid 32194] epoll_pwait(195,  <unfinished ...>
[pid 32193] epoll_pwait(192,  <unfinished ...>
[pid 32192] epoll_pwait(189,  <unfinished ...>
[pid 32191] epoll_pwait(186,  <unfinished ...>
[pid 32190] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32189] futex(0x7f2b1594b7a4, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32188] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32187] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32186] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32185] futex(0x7f2b3ac6c860, FUTEX_WAIT_PRIVATE, 4294967295, NULL <unfinished ...>
[pid 32184] futex(0x7f2b15e53284, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32183] futex(0x7f2b15f54284, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32182] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32181] futex(0x7f2b2850d7e4, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32180] futex(0x7f2b2860f7e4, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32179] futex(0x7f2b287117e4, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32178] futex(0x7f2b288137e4, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32177] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 32095] futex(0x7f2b3b337b20, FUTEX_WAIT, 14, NULL <unfinished ...>
[pid 32190] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999658} <unfinished ...>
[pid 32177] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999714} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999725}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999694} <unfinished ...>
[pid 32195] <... epoll_pwait resumed> [], 8192, 141, NULL, 8) = 0
[pid 32195] epoll_pwait(198,  <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999722} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999692}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999723} <unfinished ...>
[pid 32194] <... epoll_pwait resumed> [], 8192, 205, NULL, 8) = 0
[pid 32194] epoll_pwait(195,  <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999360} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999371} <unfinished ...>
[pid 32192] <... epoll_pwait resumed> [], 8192, 284, NULL, 8) = 0
[pid 32191] <... epoll_pwait resumed> [], 8192, 284, NULL, 8) = 0
[pid 32192] epoll_pwait(189,  <unfinished ...>
[pid 32191] epoll_pwait(186,  <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999499} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999495} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999497}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999455} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999710} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999723}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999718} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999719} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999722}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999535} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999719} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999716} <unfinished ...>
[pid 32182] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32182] futex(0x7f2b160566f4, FUTEX_WAIT_PRIVATE, 2, {0, 999999646} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999696} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999720} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999723} <unfinished ...>
[pid 32193] <... epoll_pwait resumed> [], 8192, 766, NULL, 8) = 0
[pid 32193] epoll_pwait(192,  <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999722} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999716} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999691} <unfinished ...>
[pid 32187] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32188] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32188] futex(0x7f2b15a4d6d4, FUTEX_WAIT_PRIVATE, 2, {4, 999999671} <unfinished ...>
[pid 32187] futex(0x7f2b15b4f6d4, FUTEX_WAIT_PRIVATE, 2, {4, 999999664} <unfinished ...>
[pid 32186] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32186] futex(0x7f2b15c516d4, FUTEX_WAIT_PRIVATE, 2, {4, 999999662} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999716} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999714} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999697}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999646} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999595} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999598}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999723} <unfinished ...>
[pid 32195] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32195] epoll_pwait(198,  <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999717} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999697} <unfinished ...>
[pid 32194] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32194] epoll_pwait(195,  <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999725} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999715} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999723} <unfinished ...>
[pid 32192] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32191] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32192] epoll_pwait(189,  <unfinished ...>
[pid 32191] epoll_pwait(186,  <unfinished ...>
[pid 32197] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32196] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 32197] futex(0x7f2b0f6aef64, FUTEX_WAIT_PRIVATE, 2, {4, 999992393} <unfinished ...>
[pid 32196] futex(0x7f2b0f7aff64, FUTEX_WAIT_PRIVATE, 2, {4, 999992238} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999703} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999692} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999688}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999725} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999696} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999702}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999727} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999694} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999698}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999722} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999701} <unfinished ...>
[pid 32182] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32182] futex(0x7f2b160566f4, FUTEX_WAIT_PRIVATE, 2, {0, 999999673} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999721}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999724} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999692} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999698} <unfinished ...>
[pid 32193] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32193] epoll_pwait(192,  <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999727} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999714} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999717}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999715} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999714} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999719}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999716} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999715} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999719}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999654} <unfinished ...>
[pid 32195] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32195] epoll_pwait(198,  <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999616} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999609} <unfinished ...>
[pid 32194] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32194] epoll_pwait(195,  <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999719} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999719} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999717} <unfinished ...>
[pid 32192] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32191] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32192] epoll_pwait(189,  <unfinished ...>
[pid 32191] epoll_pwait(186,  <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999682} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999688} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999683}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999721} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999713} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999716}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999718} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999717} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999717}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999721} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999723} <unfinished ...>
[pid 32182] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32182] futex(0x7f2b160566f4, FUTEX_WAIT_PRIVATE, 2, {0, 999999674} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999718}) = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999717} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999717} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999716} <unfinished ...>
[pid 32193] <... epoll_pwait resumed> [], 8192, 1000, NULL, 8) = 0
[pid 32193] epoll_pwait(192,  <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32190] futex(0x7f2b1584a834, FUTEX_WAIT_PRIVATE, 2, {0, 49999714} <unfinished ...>
[pid 32177] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 32177] futex(0x7f2b3b337274, FUTEX_WAIT_PRIVATE, 2, {0, 99999536} <unfinished ...>
[pid 32190] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)

Seems like the problem may be with Tika never returning it's meta data, but it doesn't seem to be logging or printing anything. Any ideas?

dadoonet commented 6 years ago

Could you share the file named file-server service.sh?

Muffinman commented 6 years ago

Oh that's just our service start/stop script.

#!/bin/bash

# This is designed called from Systemd service on system bootup and shutdown. It is not suitable for development use.
# Don't forget to setup log rotation for /var/log/printpath*.log

start() {
  exec docker-compose -f /home/orphans/printpath/docker-compose.yml up
}

stop() {
  exec docker-compose -f /home/orphans/printpath/docker-compose.yml down
}

case $1 in
  start|stop) "$1" ;;
esac
Muffinman commented 6 years ago

And the docker-compose.yml file for reference:

#############
#
# For use in all environments
#
# Common components (NGINX, PHP and MySQL) are presumed to be installed
# on the host machine so this only takes care of the niche requirements.
#
#############
version: "3.1"
services:
  redis:
    image: redis:alpine
    container_name: printpath-redis
    ports:
      - "14235:6379"
    volumes:
      - "./docker/storage/redis:/data"

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.4.0
    container_name: printpath-elasticsearch
    environment:
      - xpack.security.enabled=false
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - http.cors.enabled=true
      - http.cors.allow-origin=*
      - http.cors.allow-methods=OPTIONS, HEAD, GET, POST, PUT, DELETE
      - http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type, Content-Length, Authorization
      - network.host=0.0.0.0
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    volumes:
      - ./docker/storage/elasticsearch:/usr/share/elasticsearch/data:cached
    ports:
      - "14236:9200"

  fscrawler_shared:
    build:
      context: ${PWD}/docker/build/fscrawler
    container_name: printpath-fscrawler-shared
    volumes:
      - ${SHARED_PATH}:/usr/share/fscrawler/data:ro
      - ./docker/storage/fscrawler/shared/:/usr/share/fscrawler/config/shared:cached
    depends_on:
      - elasticsearch
    environment:
      - WAIT_COMMAND=[ $$(curl --write-out %{http_code} --silent --output /dev/null http://elasticsearch:9200/_cat/health?h=st) = 200 ]
      - WAIT_START_CMD=fscrawler --trace --config_dir /usr/share/fscrawler/config shared
      - WAIT_SLEEP=2
      - WAIT_LOOPS=10
    command: bash wait_to_start.sh

  fscrawler_archive:
    build:
      context: ${PWD}/docker/build/fscrawler
    container_name: printpath-fscrawler-archive
    volumes:
      - ${ARCHIVE_PATH}:/usr/share/fscrawler/data:ro
      - ./docker/storage/fscrawler/archive:/usr/share/fscrawler/config/archive:cached
    depends_on:
      - elasticsearch
    environment:
      - WAIT_COMMAND=[ $$(curl --write-out %{http_code} --silent --output /dev/null http://elasticsearch:9200/_cat/health?h=st) = 200 ]
      - WAIT_START_CMD=fscrawler --trace --config_dir /usr/share/fscrawler/config archive
      - WAIT_SLEEP=2
      - WAIT_LOOPS=10
    command: bash wait_to_start.sh

  fscrawler_library:
    build:
      context: ${PWD}/docker/build/fscrawler
    container_name: printpath-fscrawler-library
    volumes:
      - ${LIBRARY_PATH}:/usr/share/fscrawler/data:ro
      - ./docker/storage/fscrawler/library:/usr/share/fscrawler/config/library:cached
    depends_on:
      - elasticsearch
    environment:
      - WAIT_COMMAND=[ $$(curl --write-out %{http_code} --silent --output /dev/null http://elasticsearch:9200/_cat/health?h=st) = 200 ]
      - WAIT_START_CMD=fscrawler --trace --config_dir /usr/share/fscrawler/config library
      - WAIT_SLEEP=2
      - WAIT_LOOPS=10
    command: bash wait_to_start.sh

The wait_to_start.sh script just checks the output of the curl request, waiting for ES to be ready, before starting FSCrawler.

dadoonet commented 6 years ago

Sorry I meant Fiona-Cooke-Facebook-Amendments.pdf.

Muffinman commented 6 years ago

Sent via email, thanks!

dadoonet commented 6 years ago

Hi @Muffinman

Sorry for the delay.

I just tried to parse your file with Tika and this worked well although it took several seconds, may be 20s, to parse it. I just tried with a Unit Test though. I need to try in a IT context. Stay tuned.

dadoonet commented 6 years ago

I tried in the context of an integration test and everything worked well. I have no idea of what is happening. Wondering if this could be a memory issue. Could you try with only this file?

Muffinman commented 6 years ago

I have a feeling it's something to do with running it in a docker container, but I'm not sure exactly why that's an issue.

Will keep looking into it and report back if I find the cause. Thanks for taking a look.

EDIT: My docker stats, look like memory isn't an issue?

CONTAINER ID        NAME                          CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
12a1172e28ff        printpath-fscrawler-shared    0.07%               1.381GiB / 31.3GiB   4.41%               5.1MB / 6.19MB      195MB / 9.42MB      24
19d272036ab8        printpath-fscrawler-library   0.07%               233.8MiB / 31.3GiB   0.73%               13.1MB / 16.2MB     3.06GB / 4.1kB      23
435f74e186c7        printpath-fscrawler-archive   0.07%               1.735GiB / 31.3GiB   5.54%               79.8MB / 96.6MB     12.8GB / 13.1MB     25
512fd36cd96a        printpath-elasticsearch       2.44%               915.9MiB / 31.3GiB   2.86%               119MB / 101MB       153MB / 71.6MB      65
Muffinman commented 6 years ago

Actually one thing occurred to me which may be an issue.

The directory indexing seems to be fast and completes within a few minutes and then shows this:

08:16:01,133 DEBUG [f.p.e.c.f.FsParser] Fs crawler is going to sleep for 1d
08:16:01,570 TRACE [f.p.e.c.f.c.ElasticsearchClientManager] Sending a bulk request of [79] requests
08:16:01,605 TRACE [f.p.e.c.f.c.ElasticsearchClientManager] Executed bulk request with [79] requests

Could it be the folder crawling is finishing and putting the FSCrawler processes to sleep before file crawling has finished? I'm not sure how the internals work.

dadoonet commented 6 years ago

look like memory isn't an issue?

Yeah right.

Executed bulk request with [79] requests

Means that all documents have been sent to elasticsearch and have been accepted.

133 DEBUG [f.p.e.c.f.FsParser] Fs crawler is going to sleep for 1d

Could it be the folder crawling is finishing and putting the FSCrawler processes to sleep before file crawling has finished?

This means that FSCrawler has parsed everything needed and is going to sleep the Crawling thread. But it does no stop the bulk processor thread which is effectively indexing the pending documents.

Everything looks good here IMO.

dadoonet commented 6 years ago

Not sure what else I can add for now on this. I'm going to close as everything looks good but feel free to reopen and/or add more comments.