dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.35k stars 300 forks source link

documents.log is empty, but documents are getting sent to my index #1667

Open UltraSalem opened 1 year ago

UltraSalem commented 1 year ago

Describe the bug

Running docker-compose, set the logging directory in the docker-compose file. fscrawler.log gets populated and rotated, but documents.log in that same folder does not.

Job Settings

$ cat config/whitedwarfscryer/_settings.yaml


---
name: "whitedwarfscryer"
fs:
  indexed_chars: -1
  continue_on_error: true
  add_filesize: true
  store_source: false
  index_content: true
  filename_as_id: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "http://elasticsearch:9200"
  username: "[redacted]"
  password: "[redacted]"
  ssl_verification: false
  bulk_size: 200
  flush_interval: "5s"
  byte_size: "25mb"

$ cat docker-compose.yml

version: '3'
services:
  fscrawler:
    image: dadoonet/fscrawler
    container_name: fscrawler
    volumes:
      - "/zdata/zsalem/Downloads/death stuffs/Games/WhiteDwarfs/first200:/tmp/es:ro"
      - ${PWD}/config:/root/.fscrawler
      - ${PWD}/logs:/usr/share/fscrawler/logs
    command: fscrawler whitedwarfscryer --loop 1 
    networks:
      - es_network
networks:
  es_network:
    external:
      name: es_network

Logs

$ docker logs fscrawler

SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.20.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.

$ cat fscrawler.log

03:57:45,482 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
|   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
|   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
|   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
|   `----'                  `---`            `--`---'       '---"                `----'              |
+----------------------------------------------------------------------------------------------------+
|                                        You know, for Files!                                        |
|                                     Made from France with Love                                     |
|                           Source: https://github.com/dadoonet/fscrawler/                           |
|                          Documentation: https://fscrawler.readthedocs.io/                          |
`----------------------------------------------------------------------------------------------------'

03:57:45,497 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [500.2mb/7.8gb=6.25%], RAM [5.1gb/31.2gb=16.49%], Swap [7.9gb/7.9gb=100.0%].
03:57:45,723 WARN  [f.p.e.c.f.c.FsCrawlerCli] `url` is not set. Please define it. Falling back to default: [/tmp/es].
03:57:45,731 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
03:57:45,809 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
03:57:46,143 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.7.1
03:57:46,146 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
03:57:46,178 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.7.1
03:57:46,198 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [whitedwarfscryer] for [/tmp/es] every [15m]
03:57:58,595 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Expected behavior

I am expecting that as fscrawler runs from inside this container, that the documents.log would be populated. It seems like it got created the first time I ran this container a week ago, but has never had any info in it, despite my index being populated successfully and fscrawler.log getting populated and rotated. But assuming the documents are being scanned in alphabetical order (I could not find any info in the docs, but Bard said it was alphabetical first...?) White Dwarf Magazine Issue 001 - Jun 1977 (UK)-001.pdf , not all documents are going in to my index, so I suspect some are erroring out, but I can't see what ones they are and I don't want to manually check 13,000 documents. Hence looking for documents.log info.

-r--r--r-- 1 root  root     0 May 29 23:44 documents.log
-rw-r--r-- 1 root  root  1.1K May 30 00:21 fscrawler-2023-05-29-1.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 00:31 fscrawler-2023-05-29-2.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 00:32 fscrawler-2023-05-29-3.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 00:39 fscrawler-2023-05-29-4.log.gz
-rw-r--r-- 1 root  root  1.2K May 30 01:56 fscrawler-2023-05-29-5.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 02:05 fscrawler-2023-05-29-6.log.gz
-rw-r--r-- 1 root  root  1.1K May 31 03:06 fscrawler-2023-05-29-7.log.gz
-rw-r--r-- 1 root  root   154 Jun  3 14:38 fscrawler-2023-05-30-1.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 14:38 fscrawler-2023-06-03-1.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 14:39 fscrawler-2023-06-03-2.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 17:45 fscrawler-2023-06-03-3.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 18:01 fscrawler-2023-06-03-4.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  4 00:53 fscrawler-2023-06-03-5.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  4 13:25 fscrawler-2023-06-03-6.log.gz
-rw-r--r-- 1 root  root   829 Jun  4 13:47 fscrawler-2023-06-04-1.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  4 13:57 fscrawler-2023-06-04-2.log.gz
-rw-r--r-- 1 root  root  3.2K Jun  4 13:57 fscrawler.log

Versions:

Attachment

Attempting to attach the document that should have been scanned first, but does not appear in my index.

UltraSalem commented 1 year ago

ok that attachment got inserted not where I was expecting, sorry! but still kind of a relevant spot at least :)

UltraSalem commented 1 year ago

ok the document White.Dwarf.Magazine.Issue.001.-.Jun.1977.UK.-001.pdf is now in my index now that the job has completed (13,425 documents). Which is weird as it has the oldest created date, and the first name in alphabetical order, of all the documents, so it should have been the first documentin there, not sometime after 7000 other documents.

documents.log is still empty.

$cat logs/fscrawler.log

05:58:06,284 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
05:58:06,415 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [whitedwarfscryer] stopped
05:58:06,418 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [whitedwarfscryer] stopped
dadoonet commented 1 year ago

So what should be a good order in your opinion?

For some use cases, I have the feeling that the most recent documents are the most relevant vs the oldest. What do you think?

UltraSalem commented 1 year ago

I think oldest file first, by the date time it arrives in the scanning folder (last modified date maybe?). Users will expect first in, first out, for the index when they're using it. So if a set of files get written in to the monitored folders over the day, the user would expect to see the first ones that went in appear in the index first.

That's my thoughts anyway! I don't really mind as long as I can find whatever it is documented somewhere. I can mess around with data prep to get the order I need if I have a particular requirement, as long as I know what I'm aiming for..

ScottCov commented 1 year ago

I am experiencing the same issue with documents.log using docker although my documents.log file does record errors, it isn't recording documents indexed:

2023-08-02 08:38:31,003 [ERROR] [603.pdf][/23-90020/603.pdf] Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout 2023-08-02 16:22:08,837 [ERROR] [859-9.pdf][/23-90020/859-9.pdf] Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout