dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Exclude of Windows Directory is not working #862

Open wjsbgsnwss opened 4 years ago

wjsbgsnwss commented 4 years ago

Describe the bug

The Windows system folder such as C:\$Recycle.bin can not be excluded, even though it is defined as any one of the below:

exclude:

To Reproduce

As above and the error log is as below:

20:46:05,432 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\, C:\$Recycle.Bin\S-1-5-21-38895556-2084396700-2036031536-69525) = $Recycle.Bin/S-1-5-21-38895556-2084396700-2036031536-69525
20:46:05,433 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing gomypc_folder/4433e382ff9f9129db01c5629977c8c?pipeline=null
20:46:05,436 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\$Recycle.Bin\S-1-5-21-38895556-2084396700-2036031536-69525] content
20:46:05,437 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C:\$Recycle.Bin\S-1-5-21-38895556-2084396700-2036031536-69525

Expected behavior

Not shown in the log (debug mode)

Versions:

Screenshots

NA

dadoonet commented 4 years ago

What is exactly the FSCrawler settings file please?

wjsbgsnwss commented 4 years ago
---
name: "xxx"
fs:
  url: "C:\\"
  excludes: 
  - "C:\\Program\ Files"
  - "C:\\Program\ Files\ (x86)"
  - "C:\\python\-3.6.5"
  - "C:\\Python27"
  - "C:\\tobedeleted"
  - "C:\\Windows"
  - "C:\\Windows10Upgrade"
  - "C:\\winnt"
  - "C:\\elastic741"
  - "C:\\DONOTDELETE"
  - "C:\\$Recycle.Bin"  
  update_rate: "15m"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  follow_symlink: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://xxx:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
wjsbgsnwss commented 4 years ago

Tried this as well, did not work either.

---
name: "xxx"
fs:
  url: "C:\\"
  excludes: 
  - "C:\\Program Files\\*"
  - "C:\\Program Files (x86)\\*"
  - "C:\\python\-3.6.5\\*"
  - "C:\\Python27\\*"
  - "C:\\tobedeleted\\*"
  - "C:\\Windows\\*"
  - "C:\\Windows10Upgrade\\*"
  - "C:\\winnt\\*"
  - "C:\\elastic741\\*"
  - "C:\\DONOTDELETE\\*"
  - "C:\\$Recycle.Bin\\*"  
  update_rate: "15m"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  follow_symlink: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://xxx:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
dadoonet commented 4 years ago

Could you run it with --trace option?

wjsbgsnwss commented 4 years ago

Here is the yaml:


---
name: "xxxx"
fs:
  url: "C:\\"
  excludes: 
  - "C:\\DONOTDELETE"
  - "C:\\$Recycle.Bin"  
  update_rate: "15m"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  follow_symlink: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://xxxx:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

And here is the excerpt of the output with trace option on:


07:23:37,118 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='sidebar.js', file=true, directory=false, lastModifiedDate=2019-11-12T11:50:00.836873, creationDate=2019-11-14T18:44:14.173026, accessDate=2019-11-14T18:44:14.173026, path='C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components', owner='', group='null', permissions=-1, extension='js', fullpath='C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js', size=0}
07:23:37,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\, C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js) = DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js
07:23:37,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js], includes = [null], excludes = [[C:\DONOTDELETE, C:\$Recycle.Bin]]
07:23:37,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js], excludes = [[C:\DONOTDELETE, C:\$Recycle.Bin]]
07:23:37,119 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [c:\donotdelete]
07:23:37,119 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [c:\$recycle.bin]
07:23:37,119 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
07:23:37,120 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js], includes = [null]
07:23:37,120 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
07:23:37,120 DEBUG [f.p.e.c.f.FsParserAbstract] [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js] can be indexed: [true]
07:23:37,120 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js
07:23:37,120 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components],[sidebar.js]
07:23:37,121 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\, C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js) = DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js
07:23:37,121 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js]
07:23:37,121 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
07:23:37,121 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js]  -> InputStream must have > 0 bytes
07:23:37,122 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js]
org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122) ~[tika-core-1.22.jar:1.22]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:474) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:267) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
07:23:37,123 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
07:23:37,123 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches.
07:23:37,124 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing gomypc/1e81981d16db5e98df7aa13f77dd3f9e?pipeline=null
07:23:37,124 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "meta" : { },
  "file" : {
    "extension" : "js",
    "content_type" : "application/javascript",
    "created" : "2019-11-14T07:44:14.173+0000",

I had no chance to capture the log with respect to C:\$Recycle.Bin directory

dadoonet commented 4 years ago

Could you try with:

---
name: "xxxx"
fs:
  url: "C:\\"
  excludes: 
  - "donotdelete"
  - "\\$recycle\\.bin"  
  update_rate: "15m"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  follow_symlink: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://xxxx:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Or also with \$recycle\.bin

wjsbgsnwss commented 4 years ago

hi David,

Thanks for the advice, and here is the output of the trace log. I am not able to test the $Recycle.Bin yet as the huge log, and so far the 'donotdelete' exlude does not seem to work:

Two lines are copied here:

regex is [donnotdelete] [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi] can be indexed: [true]

Is the directory name matching case sensitive?


09:50:04,442 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi], includes = [null]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
09:50:04,442 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi], excludes = [[donnotdelete, \$recycle\.bin, elastic741, elk, Intel, iSkysoft Video Converter Ultimate, MDTBuild, msys64, PerfLogs, Program Files, Program Files (x86), ProgramData, Python27, python-3.6.5, Recovery, System Volume Information, temp, Windows, Windows10Upgrade, winnt]]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [donnotdelete]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [\$recycle\.bin]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [elastic741]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [elk]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [intel]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [iskysoft video converter ultimate]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [mdtbuild]
09:50:04,443 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [msys64]
09:50:04,443 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [perflogs]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [program files]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [program files (x86)]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [programdata]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [python27]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [python-3.6.5]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [recovery]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [system volume information]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [temp]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [windows]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [windows10upgrade]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [winnt]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
09:50:04,450 DEBUG [f.p.e.c.f.FsParserAbstract]###  [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi] can be indexed: [true]
09:50:04,451 DEBUG [f.p.e.c.f.FsParserAbstract]   - folder: elapi
09:50:04,451 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\, C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi) = DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi
09:50:04,451 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing gomypc_folder/efeddc6fc4a0cb13774482b9accfcea8?pipeline=null
09:50:04,452 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "root" : "8e14762484c92df63a54988a7375a8b",
  "virtual" : "DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi",
  "real" : "C:\\DONOTDELETE\\apache-tomcat-9.0.27\\webapps\\docs\\elapi"
}
09:50:04,452 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi] content
09:50:04,452 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi
09:50:04,455 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi\index.html] on [windows 10]
wjsbgsnwss commented 4 years ago

And also, may I suggest to add the feature to allow the 'url' to take multiple directories? In Linux it is not a big deal , but in Windows is always troublesome.

In Windows, I tried to create the shorcuts to put together multiple directories into a single entry point, but it did seem to work.

dadoonet commented 4 years ago

I looked at it.

The exclude regex you are using is donnotdelete but the dir you are comparing to is DONOTDELETE. The first has 2 "n" where the second has one "n". That probably explains it.

Could you check and use donotdelete as the exclude content?