dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.36k stars 299 forks source link

File owner not determined warnings are coming for network drive - fsCrawler indexing sharepoint files #932

Closed Lisahtwy closed 3 years ago

Lisahtwy commented 4 years ago

Hi David,

I have mounted my sharepoint vm to a network drive and ran fscrawler. It is displaying below warnings. here are the debug logs.

   C:\Program Files\fscrawler-es7-2.7-SNAPSHOT>.\bin\fscrawler index_sharepoint --debug
    11:09:22,977 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [10.3mb/247.5mb=4.19%], RAM [126.6mb/1023.6mb=12.37%], Swap [1.7gb/3.5gb=49.46%].
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
    11:09:23,009 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [index_sharepoint]...
    11:09:23,493 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7
    11:09:24,946 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
    11:09:25,071 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
    11:09:25,087 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
    11:09:25,118 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.5.1] node.
    11:09:25,118 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [index_sharepoint]
    11:09:25,993 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [index_sharepoint]
    11:09:26,055 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [index_sharepoint_folder]
    11:09:26,118 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [index_sharepoint_folder]
    11:09:26,165 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [index_sharepoint] for [W:\fsSharepointFiles] every [15m]
    11:09:26,180 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_sharepoint] for [W:\fsSharepointFiles] every [15m]
    11:09:26,180 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [index_sharepoint] is now running. Run #1...
    11:09:26,321 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles) = /
    11:09:26,321 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint_folder/e6e39586a01b119482edbc6549b99d21?pipeline=null
    11:09:26,337 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [W:\fsSharepointFiles] content
    11:09:26,337 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from W:\fsSharepointFiles
    11:09:28,790 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile2.txt: W:\fsSharepointFiles\fsSharepointfile2.txt: Incorrect function.

    11:09:28,837 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile1.txt: W:\fsSharepointFiles\fsSharepointfile1.txt: Incorrect function.

    11:09:28,868 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile4.txt: W:\fsSharepointFiles\fsSharepointfile4.txt: Incorrect function.

    11:09:28,915 WARN  [f.p.e.c.f.f.FsCrawlerUtil] Failed to determine 'owner' of W:\fsSharepointFiles\fsSharepointfile3.txt: W:\fsSharepointFiles\fsSharepointfile3.txt: Incorrect function.

    11:09:28,930 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 4 local files found
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile2.txt) = /fsSharepointfile2.txt
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile2.txt], includes = [null], excludes = [[*/~*]]
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile2.txt], excludes = [[*/~*]]
    11:09:28,930 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile2.txt], includes = [null]
    11:09:28,930 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile2.txt] can be indexed: [true]
    11:09:28,930 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile2.txt
    11:09:29,024 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile2.txt]
    11:09:29,024 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile2.txt) = /fsSharepointfile2.txt
    11:09:29,102 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
    11:09:29,134 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
    11:09:30,430 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
    See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
    for optional dependencies.

    11:09:31,024 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
    11:09:31,024 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
    11:09:31,587 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/7a14b8708c8b2ddceb2f3f3657e6889?pipeline=null
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile1.txt) = /fsSharepointfile1.txt
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile1.txt], includes = [null], excludes = [[*/~*]]
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile1.txt], excludes = [[*/~*]]
    11:09:31,587 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile1.txt], includes = [null]
    11:09:31,587 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile1.txt] can be indexed: [true]
    11:09:31,587 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile1.txt
    11:09:31,602 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile1.txt]
    11:09:31,633 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile1.txt) = /fsSharepointfile1.txt
    11:09:31,680 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/cde27657a17be8db852aecb17b97ad6d?pipeline=null
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile4.txt) = /fsSharepointfile4.txt
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile4.txt], includes = [null], excludes = [[*/~*]]
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile4.txt], excludes = [[*/~*]]
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile4.txt], includes = [null]
    11:09:31,696 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile4.txt] can be indexed: [true]
    11:09:31,696 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile4.txt
    11:09:31,696 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile4.txt]
    11:09:31,696 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile4.txt) = /fsSharepointfile4.txt
    11:09:31,727 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/fb1515bfb746acca7b73a1b88ac52?pipeline=null
    11:09:31,727 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile3.txt) = /fsSharepointfile3.txt
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/fsSharepointfile3.txt], includes = [null], excludes = [[*/~*]]
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile3.txt], excludes = [[*/~*]]
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/fsSharepointfile3.txt], includes = [null]
    11:09:31,743 DEBUG [f.p.e.c.f.FsParserAbstract] [/fsSharepointfile3.txt] can be indexed: [true]
    11:09:31,743 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /fsSharepointfile3.txt
    11:09:31,743 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [W:\fsSharepointFiles],[fsSharepointfile3.txt]
    11:09:31,743 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(W:\fsSharepointFiles, W:\fsSharepointFiles\fsSharepointfile3.txt) = /fsSharepointfile3.txt
    11:09:31,899 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing index_sharepoint/81e0219ddfdaa3c1641fcdb73f78d8?pipeline=null
    11:09:31,899 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [W:\fsSharepointFiles]...
    11:09:31,977 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [W:\fsSharepointFiles]...
    11:09:32,258 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m 

-Lisa

dadoonet commented 4 years ago

The only guess I can make is that the user which created the file is unknown on the machine where the drive is mounted. Is there a way to access the file from the machine where FSCrawler is running through the file manager, right click on it and share a screen capture (the owner tab if it exists)?

Lisahtwy commented 4 years ago

I dont see any owner tab in the properties.. is this what you asked for?

image

image

dadoonet commented 4 years ago

There's apparently nothing about the owner here. I'm going to update the project to add a new trace and will ask you if you can run another test for me.

Lisahtwy commented 4 years ago

sure!

dadoonet commented 4 years ago

Could you run this version https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/fscrawler-es7-2.7-20200411.092138-97.zip with --debug mode. It should print a full stacktrace. Could you share it then?

Lisahtwy commented 4 years ago

sorry, just saw your response. will do it today!

dadoonet commented 4 years ago

Hey @Lisahtwy. Did you have time to test this?

dadoonet commented 3 years ago

No further news so I'm closing. Feel free to reopen the issue with more details. Thanks!