Open vid-bin opened 1 year ago
No idea. Could you share the file you are trying to index?
Yes, attached.
After looking into this some more it does appear to index the file but it doesn't process OCR and the file isn't searchable.
Ok. So could you run fscrawler with --debug --restart
options and share the full logs here?
Please have only one file in the directory to avoid too many logs ;)
I will do this when able, probably won’t be until tomorrow.
From some research it looks like tesseract doesnt support heic. Is it possible to code fscrawler to generate a temporary jpeg file of the image so tesseract can run ocr on it and then remove the temporary file?
From some research it looks like tesseract doesnt support heic. Is it possible to code fscrawler to generate a temporary jpeg file of the image so tesseract can run ocr on it and then remove the temporary file?
FSCrawler is "just" using OCR provided by Tika. So may be you should open an issue in the Tika issue tracker for this?
Okay that probably isn’t the issue then. Please give me some time to get home and run the debug command before closing the issue.
jpeg images process correctly and can be searched, so seems to be an issue specific to heic.
I think you are right with this:
From some research it looks like tesseract doesnt support heic. Is it possible to code fscrawler to generate a temporary jpeg file of the image so tesseract can run ocr on it and then remove the temporary file?
But I think the best place to support such a thing is in Tika...
My 2 cents
It is indexing the characters but it seems to be not saving them or something to Elasticsearch? I'm not sure. The content doesn't show up when querying the Elasticsearch backend directly or with samba+elasticsearch. However, when using a jpeg file it works as intended.
6:17:39,184 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1003.9mb/15.6gb=6.25%], RAM [11.2gb/62.7gb=18.0%], Swap [0b/0b=0.0]. 16:17:39,185 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists 16:17:39,185 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists 16:17:39,185 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists 16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists 16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_wpsearch_settings.json] already exists 16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [8/_settings.json] already exists 16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [8/_settings_folder.json] already exists 16:17:39,186 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [8/_wpsearch_settings.json] already exists 16:17:39,186 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [icloud]... 16:17:39,187 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [icloud]... 16:17:39,347 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler 16:17:39,409 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production. 16:17:39,427 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version SLF4J: No SLF4J providers were found. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details. SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier. SLF4J: Ignoring binding found at [jar:file:/home/thisuserhere/Desktop/fscrawler/fscrawler-distribution-2.10-SNAPSHOT/lib/log4j-slf4j-impl-2.21.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation. 16:17:39,652 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 7.17.10 and 7 as the major version number 16:17:39,652 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 7.17.10 16:17:39,654 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service started 16:17:39,655 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production. 16:17:39,656 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version 16:17:39,680 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 7.17.10 and 7 as the major version number 16:17:39,680 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 7.17.10 16:17:39,680 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service started 16:17:39,681 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [thisuserhere] 16:17:39,689 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://127.0.0.1:9200/thisuserhere: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [thisuserhere/Wz3K5zsZQj6-Wz7Nxq3siA] already exists","index_uuid":"Wz3K5zsZQj6-Wz7Nxq3siA","index":"thisuserhere"}],"type":"resource_already_exists_exception","reason":"index [thisuserhere/Wz3K5zsZQj6-Wz7Nxq3siA] already exists","index_uuid":"Wz3K5zsZQj6-Wz7Nxq3siA","index":"thisuserhere"},"status":400} 16:17:39,689 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [thisuserhere]: HTTP 400 Bad Request 16:17:39,689 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [thisuserhere_folder] 16:17:39,692 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://127.0.0.1:9200/thisuserhere_folder: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [thisuserhere_folder/w6w1RObWQ_2XClZVw2bHWA] already exists","index_uuid":"w6w1RObWQ_2XClZVw2bHWA","index":"thisuserhere_folder"}],"type":"resource_already_exists_exception","reason":"index [thisuserhere_folder/w6w1RObWQ_2XClZVw2bHWA] already exists","index_uuid":"w6w1RObWQ_2XClZVw2bHWA","index":"thisuserhere_folder"},"status":400} 16:17:39,692 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [thisuserhere_folder]: HTTP 400 Bad Request 16:17:39,693 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [thisuserhere] for [/home/thisuserhere/storage/iCloud/temp] every [3m] 16:17:39,693 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [thisuserhere] for [/home/thisuserhere/storage/iCloud/temp] every [3m] 16:17:39,694 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [thisuserhere] is now running. Run #1... 16:17:39,700 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/home/thisuserhere/storage/iCloud/temp] content 16:17:39,700 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from /home/thisuserhere/storage/iCloud/temp 16:17:39,703 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found 16:17:39,703 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/.DS_Store) = /.DS_Store 16:17:39,703 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/.DS_Store], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]], excludes = [[/~]] 16:17:39,704 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/.DS_Store], excludes = [[/~]] 16:17:39,704 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/.DS_Store], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]] 16:17:39,704 DEBUG [f.p.e.c.f.FsParserAbstract] [/.DS_Store] can be indexed: [false] 16:17:39,704 DEBUG [f.p.e.c.f.FsParserAbstract] - ignored file/dir: .DS_Store 16:17:39,704 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9543.heic) = /IMG_9543.heic 16:17:39,705 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/IMG_9543.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]], excludes = [[/~]] 16:17:39,705 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9543.heic], excludes = [[/~]] 16:17:39,705 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9543.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]] 16:17:39,705 DEBUG [f.p.e.c.f.FsParserAbstract] [/IMG_9543.heic] can be indexed: [true] 16:17:39,705 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /IMG_9543.heic 16:17:39,705 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/thisuserhere/storage/iCloud/temp],[IMG_9543.heic] 16:17:39,706 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9543.heic) = /IMG_9543.heic 16:17:39,710 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings. 16:17:39,711 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng]. 16:17:39,721 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated. 16:17:39,733 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found. 16:17:39,734 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process. 16:17:40,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9543.heic) = /IMG_9543.heic 16:17:40,605 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing thisuserhere/IMG_9543.heic?pipeline=null 16:17:40,606 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/home/thisuserhere/storage/iCloud/temp]... 16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/thisuserhere/storage/iCloud/temp, /home/thisuserhere/storage/iCloud/temp/IMG_9553.heic) = /IMG_9553.heic 16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/IMG_9553.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]], excludes = [[/~]] 16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9553.heic], excludes = [[/~]] 16:17:40,615 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/IMG_9553.heic], includes = [[/.doc, /.txt, /.pdf, /.jpeg, /.jpg, /.heic, /.png, /.tiff, /.mov, /.mp4]] 16:17:40,615 DEBUG [f.p.e.c.f.FsParserAbstract] Deleting thisuserhere/IMG_9553.heic 16:17:40,616 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Deleting thisuserhere/IMG_9553.heic 16:17:40,616 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [/home/thisuserhere/storage/iCloud/temp]... 16:17:40,621 INFO [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run 16:17:40,696 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [thisuserhere] 16:17:40,697 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped 16:17:40,697 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager 16:17:40,697 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor 16:17:40,697 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed 16:17:40,701 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped 16:17:40,701 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager 16:17:40,701 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor 16:17:40,701 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed 16:17:40,702 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Executing [2] remaining actions 16:17:40,702 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 2 actions 16:17:40,708 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [2] documents to the Elasticsearch service 16:17:40,708 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 2687 characters 16:17:40,742 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 2 actions 16:17:40,743 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service stopped 16:17:40,743 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped 16:17:40,743 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [thisuserhere] stopped 16:17:40,744 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [thisuserhere] 16:17:40,745 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped 16:17:40,745 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager 16:17:40,745 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped 16:17:40,745 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager 16:17:40,745 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service stopped 16:17:40,745 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped 16:17:40,745 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [thisuserhere] stopped
Great. Could you run the same thing again with --trace --restart
and share again the logs?
You can just share what is between:
DEBUG [f.p.e.c.f.FsParserAbstract] [/IMG_9543.heic] can be indexed: [true]
and
DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/home/thisuserhere/storage/iCloud/temp]...
Okay so it's definitely not parsing OCR on the images (lang detected null, etc) but everything else seems to work.
01:42:33,973 DEBUG [f.p.e.c.f.FsParserAbstract] [/IMG_4000.heic] can be indexed: [true] 01:42:33,974 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /IMG_4000.heic 01:42:33,974 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/userhere/Array/temp],[IMG_4000.heic] 01:42:33,975 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/userhere/Array/temp, /home/userhere/Array/temp/IMG_4000.heic) = /IMG_4000.heic 01:42:33,997 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/home/userhere/Array/temp/IMG_4000.heic] 01:42:34,020 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction 01:42:34,029 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings. 01:42:34,030 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng]. 01:42:34,077 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated. 01:42:34,204 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found. 01:42:34,204 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process. 01:42:35,216 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction 01:42:35,697 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [: NONE (0.000000)] 01:42:35,700 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation 01:42:35,700 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches. 01:42:35,700 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/userhere/Array/temp, /home/userhere/Array/temp/IMG_4000.heic) = /IMG_4000.heic 01:42:35,713 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Indexing userhere/IMG_4000.heic?pipeline=null 01:42:35,714 TRACE [f.p.e.c.f.c.ElasticsearchClient] JSon indexed : {"meta":{"date":"2023-10-23T17:05:50.000+00:00","created":"2023-10-23T17:05:50.000+00:00","raw":{"ICC:Profile Connection Space":"XYZ","Minor Version":"0","ICC:Profile Copyright":"1 enUS(Copyright Apple Inc., 2016)","Exif SubIFD:Time Zone Digitized":"-07:00","X-TIKA:Parsed-By-Full-Set":"org.apache.tika.parser.DefaultParser","ICC:Class":"Input Device","ICC:Unknown tag (0x61617079)":"data (0x64617461): 14 bytes","ICC:Device manufacturer":"APPL","Exif SubIFD:Exif Image Width":"1290 pixels","ICC:Signature":"acsp","Exif SubIFD:User Comment":"Screenshot","ICC:Media White Point":"(0.9642, 1, 0.8251)","ICC:CMM Type":"appl","Exif SubIFD:Sub-Sec Time Original":"000","resourceName":"IMG_4000.heic","ICC:Version":"4.0.0","Exif IFD0:Orientation":"Top, left side (Horizontal / normal)","tiff:Orientation":"1","Major Brand":"heic","ICC:Profile Size":"30252","X-TIKA:Parsed-By":"org.apache.tika.parser.DefaultParser","Bits Per Channel":"8 8 8","ICC:Tag Count":"8","Exif IFD0:Date/Time":"2023:10:23 10:05:50","Exif SubIFD:Time Zone":"-07:00","tiff:ImageLength":"2796","dcterms:created":"2023-10-23T10:05:50","dcterms:modified":"2023-10-23T10:05:50","Exif SubIFD:Sub-Sec Time":"000","ICC:Profile Date/Time":"2016:01:01 00:00:00","Compatible Brands":"mif1 miaf MiHB heic","Exif SubIFD:Color Space":"sRGB","ICC:Profile Description":"1 enUS(Apple Wide Color Sharing Profile)","ICC:AToB 0":"mAB (0x6D414220): 29772 bytes","ICC:AToB 1":"mAB (0x6D414220): 29772 bytes","ICC:AToB 2":"mAB (0x6D414220): 29772 bytes","Height":"512 pixels","Width":"512 pixels","ICC:Color space":"RGB","Content-Type":"image/heic","Exif SubIFD:Date/Time Original":"2023:10:23 10:05:50","Exif SubIFD:Sub-Sec Time Digitized":"000","ICC:XYZ values":"0.964 1 0.825","exif:DateTimeOriginal":"2023-10-23T10:05:50","Rotation":"0 degrees","Exif SubIFD:Time Zone Original":"-07:00","Exif SubIFD:Exif Image Height":"2796 pixels","ICC:Primary Platform":"Apple Computer, Inc.","ICC:Chromatic Adaptation":"sf32 (0x73663332): 44 bytes","Exif SubIFD:Date/Time Digitized":"2023:10:23 10:05:50","tiff:ImageWidth":"1290"}},"file":{"extension":"heic","content_type":"image/heic","created":"2023-10-24T23:17:14.007+00:00","last_modified":"2023-10-24T23:17:14.007+00:00","last_accessed":"2023-10-25T07:15:32.774+00:00","indexing_date":"2023-10-25T08:42:33.975+00:00","filesize":243979,"filename":"IMG_4000.heic","url":"file:///home/userhere/Array/temp/IMG_4000.heic"},"path":{"root":"ead0d21913015c4a9d9472e67e9e2d","virtual":"/IMG_4000.heic","real":"/home/userhere/Array/temp/IMG_4000.heic"}} 01:42:35,714 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [/home/userhere/Array/temp]...
Not supported by tesseract.
Any chance you could do a band-aid solution on fscrawler by generating temporary jpeg versions to scan? This would be great for heic, jpeg-xl and avif formats.
It's frustrating because the Mac generates OCR output locally with spotlight when using heic.
Alternatively is there any chance you could code fscrawler to use whatever the Mac is using to generate heic ocr? I wouldn't mind running fscrawler on my Mac and connecting remotely to the elasticsearch server.
Alternatively is there any chance you could code fscrawler to use whatever the Mac is using to generate heic ocr?
Are you aware of any library which would allow this?
I'm using the 2.10-snapshot and I'm scanning my library but it doesn't appear to be indexing heic files. I have heic added as an included filetype in the config file.
According to https://fscrawler.readthedocs.io/en/latest/user/formats.html it supports everything Tika supports.
Tika has heic/heif as a supported format. My system is also configured with support for heic/heif files.
Any idea why it's not working?