Open wjsbgsnwss opened 4 years ago
What is exactly the FSCrawler settings file please?
---
name: "xxx"
fs:
url: "C:\\"
excludes:
- "C:\\Program\ Files"
- "C:\\Program\ Files\ (x86)"
- "C:\\python\-3.6.5"
- "C:\\Python27"
- "C:\\tobedeleted"
- "C:\\Windows"
- "C:\\Windows10Upgrade"
- "C:\\winnt"
- "C:\\elastic741"
- "C:\\DONOTDELETE"
- "C:\\$Recycle.Bin"
update_rate: "15m"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: true
follow_symlink: true
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://xxx:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
Tried this as well, did not work either.
---
name: "xxx"
fs:
url: "C:\\"
excludes:
- "C:\\Program Files\\*"
- "C:\\Program Files (x86)\\*"
- "C:\\python\-3.6.5\\*"
- "C:\\Python27\\*"
- "C:\\tobedeleted\\*"
- "C:\\Windows\\*"
- "C:\\Windows10Upgrade\\*"
- "C:\\winnt\\*"
- "C:\\elastic741\\*"
- "C:\\DONOTDELETE\\*"
- "C:\\$Recycle.Bin\\*"
update_rate: "15m"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: true
follow_symlink: true
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://xxx:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
Could you run it with --trace option?
Here is the yaml:
---
name: "xxxx"
fs:
url: "C:\\"
excludes:
- "C:\\DONOTDELETE"
- "C:\\$Recycle.Bin"
update_rate: "15m"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: true
follow_symlink: true
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://xxxx:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
And here is the excerpt of the output with trace option on:
07:23:37,118 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='sidebar.js', file=true, directory=false, lastModifiedDate=2019-11-12T11:50:00.836873, creationDate=2019-11-14T18:44:14.173026, accessDate=2019-11-14T18:44:14.173026, path='C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components', owner='', group='null', permissions=-1, extension='js', fullpath='C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js', size=0}
07:23:37,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\, C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js) = DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js
07:23:37,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js], includes = [null], excludes = [[C:\DONOTDELETE, C:\$Recycle.Bin]]
07:23:37,119 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js], excludes = [[C:\DONOTDELETE, C:\$Recycle.Bin]]
07:23:37,119 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [c:\donotdelete]
07:23:37,119 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [c:\$recycle.bin]
07:23:37,119 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
07:23:37,120 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js], includes = [null]
07:23:37,120 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
07:23:37,120 DEBUG [f.p.e.c.f.FsParserAbstract] [DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js] can be indexed: [true]
07:23:37,120 DEBUG [f.p.e.c.f.FsParserAbstract] - file: DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js
07:23:37,120 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components],[sidebar.js]
07:23:37,121 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\, C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js) = DONOTDELETE/apache-tomcat-9.0.27/webapps/Chapter07/components/sidebar.js
07:23:37,121 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js]
07:23:37,121 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
07:23:37,121 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js] -> InputStream must have > 0 bytes
07:23:37,122 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\Chapter07\components\sidebar.js]
org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:122) ~[tika-core-1.22.jar:1.22]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:138) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:93) [fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:474) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:267) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:291) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
07:23:37,123 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
07:23:37,123 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches.
07:23:37,124 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing gomypc/1e81981d16db5e98df7aa13f77dd3f9e?pipeline=null
07:23:37,124 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"meta" : { },
"file" : {
"extension" : "js",
"content_type" : "application/javascript",
"created" : "2019-11-14T07:44:14.173+0000",
I had no chance to capture the log with respect to C:\$Recycle.Bin directory
Could you try with:
---
name: "xxxx"
fs:
url: "C:\\"
excludes:
- "donotdelete"
- "\\$recycle\\.bin"
update_rate: "15m"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: true
follow_symlink: true
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://xxxx:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
Or also with \$recycle\.bin
hi David,
Thanks for the advice, and here is the output of the trace log. I am not able to test the $Recycle.Bin yet as the huge log, and so far the 'donotdelete' exlude does not seem to work:
Two lines are copied here:
regex is [donnotdelete] [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi] can be indexed: [true]
Is the directory name matching case sensitive?
09:50:04,442 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi], includes = [null]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
09:50:04,442 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi], excludes = [[donnotdelete, \$recycle\.bin, elastic741, elk, Intel, iSkysoft Video Converter Ultimate, MDTBuild, msys64, PerfLogs, Program Files, Program Files (x86), ProgramData, Python27, python-3.6.5, Recovery, System Volume Information, temp, Windows, Windows10Upgrade, winnt]]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [donnotdelete]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [\$recycle\.bin]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [elastic741]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [elk]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [intel]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [iskysoft video converter ultimate]
09:50:04,442 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [mdtbuild]
09:50:04,443 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [msys64]
09:50:04,443 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [perflogs]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [program files]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [program files (x86)]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [programdata]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [python27]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [python-3.6.5]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [recovery]
09:50:04,449 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [system volume information]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [temp]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [windows]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [windows10upgrade]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [winnt]
09:50:04,450 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
09:50:04,450 DEBUG [f.p.e.c.f.FsParserAbstract]### [DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi] can be indexed: [true]
09:50:04,451 DEBUG [f.p.e.c.f.FsParserAbstract] - folder: elapi
09:50:04,451 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\, C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi) = DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi
09:50:04,451 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing gomypc_folder/efeddc6fc4a0cb13774482b9accfcea8?pipeline=null
09:50:04,452 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"root" : "8e14762484c92df63a54988a7375a8b",
"virtual" : "DONOTDELETE/apache-tomcat-9.0.27/webapps/docs/elapi",
"real" : "C:\\DONOTDELETE\\apache-tomcat-9.0.27\\webapps\\docs\\elapi"
}
09:50:04,452 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi] content
09:50:04,452 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi
09:50:04,455 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [C:\DONOTDELETE\apache-tomcat-9.0.27\webapps\docs\elapi\index.html] on [windows 10]
And also, may I suggest to add the feature to allow the 'url' to take multiple directories? In Linux it is not a big deal , but in Windows is always troublesome.
In Windows, I tried to create the shorcuts to put together multiple directories into a single entry point, but it did seem to work.
I looked at it.
The exclude regex you are using is donnotdelete
but the dir you are comparing to is DONOTDELETE
.
The first has 2 "n" where the second has one "n". That probably explains it.
Could you check and use donotdelete
as the exclude content?
Describe the bug
The Windows system folder such as C:\$Recycle.bin can not be excluded, even though it is defined as any one of the below:
exclude:
To Reproduce
As above and the error log is as below:
Expected behavior
Not shown in the log (debug mode)
Versions:
Screenshots
NA