dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.35k stars 300 forks source link

FSCrawler 2.7 snapshot is not crawling new folder with documents with Elasticsearch 7.6.1 #934

Open rizwan-ahmad-ms opened 4 years ago

rizwan-ahmad-ms commented 4 years ago

FSCrawler runs & crawl new documents initially when Index is empty, but after some time, it doesn't crawl any new document or new folder. And give me warning as below:

05:54:00,431 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [\ ServerIP\DocStorage\Folder1\BlobContainer\25107\25107\GL\20200408-PDF]... 05:54:00,431 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:91d0d9e1c12b40118d1c233be55e7b6f] 05:54:00,446 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearc h.crawler.fs.client.ESSearchResponse@452e241f] 05:54:00,446 WARN [f.p.e.c.f.FsParserAbstract] Can't find stored field name to check existing filenames in path [\ServerIP\DocStorage\Folder1\BlobContai ner\25107\25107\GL\20200408-PDF]. Please set store: true on field [file.filename ] 05:54:00,462 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling \ServerIP\DocStorage\Folder1\BlobContainer: Mapping is incorrect: please set stored: true on field [file.filename]. 05:54:00,462 WARN [f.p.e.c.f.FsParserAbstract] Full stacktrace java.lang.RuntimeException: Mapping is incorrect: please set stored: true on fie ld [file.filename]. at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.getFileDirectory( FsParserAbstract.java:374) ~[fscrawler-core-2.7-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursive ly(FsParserAbstract.java:309) ~[fscrawler-core-2.7-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursive ly(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursive ly(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursive ly(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursive ly(FsParserAbstract.java:291) ~[fscrawler-core-2.7-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstr act.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?] at java.lang.Thread.run(Thread.java:830) [?:?] 05:54:00,477 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 1m

Versions:

dadoonet commented 4 years ago

Your mapping is incorrect. Could you share it? Please format code and logs with markdown so it's more readable.

rizwan-ahmad-ms commented 4 years ago

Here is the index mapping

put indextwi { "settings": { "number_of_shards": 1 }, "mappings": { "properties": { "content": { "type": "text" }, "storageName": { "type": "keyword", "null_value": "NULL" }, "storagePath": { "type": "keyword", "null_value": "NULL" }, "folderPath": { "type": "keyword", "null_value": "NULL" }, "aSN": { "type": "keyword", "null_value": "NULL" }, "docType": { "type": "keyword", "null_value": "NULL" }, "referenceKey": { "type": "keyword", "null_value": "NULL" }, "docNo": { "type": "keyword", "null_value": "NULL" }, "tags": { "type": "keyword", "null_value": "NULL" }, "comments": { "type": "text" }, "status": { "type": "integer" }, "oCR": { "type": "integer" }, "source": { "type": "integer" }, "seqNo": { "type": "keyword", "null_value": "NULL" }, "docDate": { "type": "date" }, "history": { "type": "keyword", "null_value": "NULL" }, "modifiedBy": { "type": "keyword", "null_value": "NULL" }, "createdBy": { "type": "keyword", "null_value": "NULL" }, "modifiedOn": { "type": "keyword", "null_value": "NULL" }, "createdOn": { "type": "keyword", "null_value": "NULL" }, "container": { "type": "keyword", "null_value": "NULL" }, "containerType": { "type": "keyword", "null_value": "NULL" }, "containerID": { "type": "keyword", "null_value": "NULL" }, "videoUrl": { "type": "keyword", "null_value": "NULL" } } } }

dadoonet commented 4 years ago

Please don't use the citation button but the code button.

The mapping you have is not coming from FSCrawler. Have a look at https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#mappings

rizwan-ahmad-ms commented 4 years ago

Thanks for assisting me. I've attached the zip file that contains mapping & config files. Please review it and let me know about the mistake.

FSCrawlerHelp.zip

Kind Regards, Rizwan

dadoonet commented 4 years ago

I see that you are renaming some fields

    "field": "file.filename",
    "target_field": "storageName",

file.filename must be kept as is at is used then by the crawler. You can copy the content to storageName if you wish but don't remove it.

Also, you probably need to remove the existing index before starting again otherwise the index template won't be applied.

rizwan-ahmad-ms commented 4 years ago

Many thanks for correcting my issues, now my files are successfully indexed and I move my index mapping in _settings.json file, and use set in ingest pipe instead of rename.

Now another issue I'm facing is that, fscrawler won't index existing files when I delete that index or re-create index with no mapping like PUT /indexname.

Secondly, is there any rest api for reset fscrawler or restart fscrawler rather than doing it from CLI?

FSCrawlerHelp - V2.zip

Kind Regards, Rizwan

dadoonet commented 4 years ago

You can manually remove the status file I think if you don't want to run FSCrawler with the --restart option.