dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

using FSCrawler on dockerized instance of Elasticsearch and Kibana #897

Closed richylyq closed 4 years ago

richylyq commented 4 years ago

Is your feature request related to a problem? Please describe.

This is more of a question rather than a feature request. I want to be able to retrieve the images from the documents i ingested into Elasticsearch. OR even better, will be able to see list of images in Kibana when I do up the visualization which I am currently not sure on whether that will work.

Describe the solution you'd like

After ingesting the documents i have, one thing that will work is having the image extracted to the job folder so I can use them whenever I want. also, ingesting the images into Elasticsearch as well will enable me to use the images in my Kibana dashboard if it works that way.

Describe alternatives you've considered

Currently no alternatives from my side, although I have been trying out the write a python script to ingest the documents, which I am planning to try to include extracting of images as well.

dadoonet commented 4 years ago

I have no idea on how this could be done. Does Tika allow retrieving such content?

richylyq commented 4 years ago

Correct me if i am wrong, You are using Apache Tika 1.22 for the fscrawler? from what we see in the documentation, images are supported, therefore I wanted to try to retrieve the images but I am not quite sure how it would work. http://tika.apache.org/1.22/formats.html#Image_formats It would be great if images can be also used after the extraction! . Sorry for imposing ><lll

dadoonet commented 4 years ago

Images are supported to extract metadata from them and to be able to run OCR on them AFAIK. I don't think, but I might be wrong, that Tika gives then access to temporary images. Do you know that?

richylyq commented 4 years ago

I realized about the extracting of metadata after reading the documentation again. Guess I would want to try every other possibilities to get at least some image data if possible to show that something from the image can be extracted, be it meaningful or not. Do you know how to go about retrieving the metadata and ingesting it into Elasticsearch?

dadoonet commented 4 years ago

If you have a jpeg file or other image file, just drop it in your folder which is scanned by FSCrawler or send it to the REST API and extraction will happen.

richylyq commented 4 years ago

If you have a jpeg file or other image file, just drop it in your folder which is scanned by FSCrawler or send it to the REST API and extraction will happen.

Therefore, I can say that I won't be able to extract metadata of image that is in the document that I am going to run through FSCrawler? Since by dropping it in the folder that is scanned by FSCrawler would mean that the folder have to be out of the document

dadoonet commented 4 years ago

Correct. Unless you do extract yourself the image. What is the use case for this? What would be the value of indexing embedded image metadata ?

richylyq commented 4 years ago

Correct. Unless you do extract yourself the image.

How do I go about doing this? Does it mean that I would have to run my own script to extract the images beforehand and save it into the folder that I declared in the settings.yaml file in the job folder?

What is the use case for this? What would be the value of indexing embedded image metadata ?

I am actually using FSCrawler to ingest PDF and Word documents into Elasticsearch, then to use Kibana to visualize the data I have. But since there will be images in the documents, I have to try to extract them as well, and conclude that if that is possible while using FSCrawler.

dadoonet commented 4 years ago

Does it mean that I would have to run my own script to extract the images beforehand and save it into the folder that I declared in the settings.yaml file in the job folder?

Correct.

But since there will be images in the documents, I have to try to extract them as well

I see. But this is not related to search but more to store. I mean that storing images is a different use case than indexing the text that could be written in the image (OCR).

conclude that if that is possible while using FSCrawler

I don't think it is.

richylyq commented 4 years ago

Does it mean that I would have to run my own script to extract the images beforehand and save it into the folder that I declared in the settings.yaml file in the job folder?

Correct.

But since there will be images in the documents, I have to try to extract them as well

I see. But this is not related to search but more to store. I mean that storing images is a different use case than indexing the text that could be written in the image (OCR).

conclude that if that is possible while using FSCrawler

I don't think it is.

Thanks for the prompt reply and for clearing my doubt! Will be back for more questions when I encounter them. xD

dadoonet commented 4 years ago

I'm closing then as I don't believe we can do anything else. Feel free to add new comments or reopen.

richylyq commented 4 years ago

Hey @dadoonet

Looks like it's time for me to reopen the issue for another issue today (:

Does the FSCrawler version affects whether I can ingest my documents into the Elasticsearch instance.

For example, I am using FSCrawler 2.7-SNAPSHOT from fscrawler-es7-2.7-20200214.112132-78 with dockerized version of Elasticsearch 7.6.0 But when I run the .bat file to ingest my documents, I get this error

11:56:55,222 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
11:56:55,223 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.lang.IllegalArgumentException: Invalid HTTP host: 192.168.1.219:9200/
        at org.apache.http.HttpHost.create(HttpHost.java:122) ~[httpcore-4.4.12.jar:4.4.12]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.lambda$buildRestClient$1(ElasticsearchClie
ntV7.java:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_231]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.buildRestClient(ElasticsearchClientV7.java
:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:141) ~[fs
crawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.j
ar:?]
11:56:55,227 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped
11:56:55,229 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped
richylyq commented 4 years ago

Hey @dadoonet

Looks like it's time for me to reopen the issue for another issue today (:

Does the FSCrawler version affects whether I can ingest my documents into the Elasticsearch instance.

For example, I am using FSCrawler 2.7-SNAPSHOT with dockerized version of Elasticsearch 7.6.0 But when I run the .bat file to ingest my documents, I get this error

11:56:55,222 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
11:56:55,223 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.lang.IllegalArgumentException: Invalid HTTP host: 192.168.1.219:9200/
        at org.apache.http.HttpHost.create(HttpHost.java:122) ~[httpcore-4.4.12.jar:4.4.12]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.lambda$buildRestClient$1(ElasticsearchClie
ntV7.java:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_231]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.buildRestClient(ElasticsearchClientV7.java
:385) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:141) ~[fs
crawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.j
ar:?]
11:56:55,227 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped
11:56:55,229 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped

Realised I cannot reopen the issue... LOL Should I open new issue instead? 🤔

dadoonet commented 4 years ago

Could you share FSCrawler settings? If you set the host to 192.168.1.219:9200/ change it to 192.168.1.219:9200.

richylyq commented 4 years ago

Could you share FSCrawler settings? If you set the host to 192.168.1.219:9200/ change it to 192.168.1.219:9200.

Got it running already HAHA

WOW... that one slash....

But I think the version of the fscrawler matters as well. I was using the 20200113 version till Valentine's Day and just thought of downloading the latest version of fscrawler which is the 20200214 version and IT WORKED... Nevertheless.... thanks for the prompt reply!! (:

20200113 version creates the folder in the Elasticsearch instance but the documents do not get ingested.

richylyq commented 4 years ago

Here's my current scenario,

I have run the fscrawler library and ingested once into Elasticsearch and the documents are reflected in the Indices page. The logs are as follows:

PS C:\Users\yuqua\Desktop\SIT\IWSP\libraries\fscrawler\fscrawler-es7-2.7-20200214.112132-78> bin\fscrawler.bat --config_
dir ./RSAF server --loop 1
14:40:07,851 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [227.8mb/3.5gb=6.31%], RAM [6.9gb/15.
8gb=43.58%], Swap [6.2gb/21.1gb=29.58%].
14:40:09,025 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node runnin
g version 7.6.0
14:40:09,082 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:40:09,293 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [server] for [C:\Users\yuqua\Desktop\SIT\IWSP\lib
raries\fscrawler\fscrawler-es7-2.7-20200214.112132-78\RSAF\server\SAL] every [15m]
14:40:09,632 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

14:40:12,699 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
14:40:13,120 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped
14:40:13,124 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped

After seeing the documents inside the Elasticsearch instance, I decided to remove the indices for further testing. To see if I am still able to ingest my documents into the Elasticsearch instance, I rerun the fscrawler library and I get these logs:

PS C:\Users\yuqua\Desktop\SIT\IWSP\libraries\fscrawler\fscrawler-es7-2.7-20200214.112132-78> bin\fscrawler.bat --config_
dir ./RSAF server --loop 1
14:42:22,651 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [227.8mb/3.5gb=6.31%], RAM [6.8gb/15.
8gb=43.16%], Swap [6.3gb/21.1gb=29.93%].
14:42:23,813 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node runnin
g version 7.6.0
14:42:23,873 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:42:28,766 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [server] for [C:\Users\yuqua\Desktop\SIT\IWSP\lib
raries\fscrawler\fscrawler-es7-2.7-20200214.112132-78\RSAF\server\SAL] every [15m]
14:42:28,871 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
14:42:28,967 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped
14:42:28,970 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [server] stopped
PS C:\Users\yuqua\Desktop\SIT\IWSP\libraries\fscrawler\fscrawler-es7-2.7-20200214.112132-78>

The main difference is not getting my documents ingested when i see the second log message. The current workaround is for me to delete the _status.json in the job folder which I believe isn't the good workaround. O:

dadoonet commented 4 years ago

If any other issue, please open a new issue.

😉

richylyq commented 4 years ago

Alright! 👍