dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.35k stars 299 forks source link

Extract ACL information #464

Open Denis4545 opened 6 years ago

Denis4545 commented 6 years ago

Hello. We are crawling FTP servers and I've noted that based metadata objects (meta.created, meta.metadata_date, meta.author) aren't extracted from txt, log, csv, html, htm files. Could you explain please why based objects aren't extracted from mentioned types files? See the example of text files below please: image

dadoonet commented 6 years ago

Because a txt file has no metadata associated with it in its file.

konovalcev commented 6 years ago

@dadoonet Hello David. +1 Could you explain more detail, please? I thought, that there are some base attributes for any file in any file system like "date created", "date modified", "size" and etc., that should be extracted by fscrawler and load into elasticsearch. The problem is, that we are working on global search inside our company and chose your crawler (thanks for this project!) to create indexes for all FTP servers. We have crawled all FTP servers and for now we are working on front-end for search system. There are should be some filters for users (like in google) by extension of file, by date created, date modified and etc to get filtered results. But we faced with this problem, that there is no "date created" information for at least .txt, .log, *.html files. I mean, that crawler doesn't extract these fields from these files. It makes our filter in front-end partial and filter will not work for all files. Do you know, why it happens with "date created" field and how we can fix this issue? Thanks.

dadoonet commented 6 years ago

Reopening to think about it.

konovalcev commented 6 years ago

@dadoonet Hello David. Thanks for your time. I will wait for your thoughts, because it is important to us and we cannot going forward without understanding this issue. Thanks again.

dadoonet commented 6 years ago

May I should try to extract more data coming from the filesystem and add that here:

https://github.com/dadoonet/fscrawler/blob/master/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsCrawlerImpl.java#L627-L653

And then if something is provided when the Tika extraction is done, overwrite the "FS" value. Like what we do in: https://github.com/dadoonet/fscrawler/blob/master/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaDocParser.java#L108-L110

That said, I'm pretty sure that some data will depends on the FS implementation. So it might be only a few things that we can capture... Or it will mean first FS detection, then try to extract more or less data depending on that which will take more time to fix IMO.

Can you list exactly what are the fields you need for your project as a MVP?

konovalcev commented 6 years ago

@dadoonet Hello David. If you are asking about it, I would like to ask to add an opportunity to extract ACL list from every file too. That will be very cool feature, that allows us to set security level in front-end, and user will get only those files in results, to which he has at least read-only access in file system. As all our FTP servers are on Windows Server, then I'm talking about windows NTFS ACL list, like this:

acl

If you add "Date Created" and ACL list in index, it will be incredible support from your side. All other fields are ok for us. If adding of ACL list is not possible, then the field, that we need is "Date Created" field only. All other fields are ok for us. Thanks!

dadoonet commented 6 years ago

May be I can use that: https://myshittycode.com/2013/09/10/reading-directoryfiles-acl-directly-from-java/

I need to give it a try when I'll have time.

konovalcev commented 6 years ago

@dadoonet Hello David. That's perfect. I will wait for your response. And maybe do you implement "Date created" field at first, because it is more important for us for now) Thanks again!

hatemjaafar commented 6 years ago

@dadoonet Happy new year! First i want to thank you for all job you done. I have to get ACL from windows file with fscrawler, hope there is news for this issue.. thanks.

dadoonet commented 6 years ago

So I'm working on that. Sadly I'm a bit blind as working on MacOS and not on Windows :( I'll try to see if I can run tests afterwards on Windows though.

Anyway, what could be a good representation of the data? Let's take as an example that we have:

Owner:
    BUILTIN\Administrators (Alias)

ACL:
    NT AUTHORITY\SYSTEM (Well-known group)        [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]
    BUILTIN\Administrators (Alias)        [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]
    MYDOMAIN\thundercat (User)        [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]

This is somewhat related to #567 where we generated:

  "attributes": {
     "owner": "david",
     "group": "staff",
     "permissions": 764
  },

I believe that owner in such a case should be replaced by BUILTIN\Administrators (Alias). And that I should add an acl structure like:

  "attributes": {
     "owner": "david",
     "acl": [{
        "user": "NT AUTHORITY\\SYSTEM (Well-known group)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "BUILTIN\\Administrators (Alias)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "MYDOMAIN\\thundercat (User)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       }
     ]
  },

With attributes.acl defined in mapping as a nested type. But this will have consequences in the number of documents indexed in Lucene at the end.

Wondering what would be the representation. What do you think @konovalcev @hatemjaafar @Denis4545 ?

dadoonet commented 5 years ago

Ping @konovalcev @hatemjaafar @Denis4545. Any thoughts?

sn0opr commented 5 years ago

@dadoonet Thank you for this project, it's helping us indexing thounsds of files for our client.

Did you added the acl feature to FSCrawler ? this really usefull.

dadoonet commented 5 years ago

@sn0opr Nope. This issue is still opened. Do you have any idea about my proposals here: https://github.com/dadoonet/fscrawler/issues/464#issuecomment-409589061 ?

sn0opr commented 5 years ago

Thank you @dadoonet for the quick reply, Actually that's what we are looking for :) . I think this structure let us know the access types for each user.

"attributes": {
     "owner": "david",
     "acl": [{
        "user": "NT AUTHORITY\\SYSTEM (Well-known group)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "BUILTIN\\Administrators (Alias)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "MYDOMAIN\\thundercat (User)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       }
     ]
  },

Please can you push the branch that you are working on for this feature, I would like to help in this part if possible. Thank's

dadoonet commented 5 years ago

I pushed that here: https://github.com/dadoonet/fscrawler/tree/wip/acl

But nothing fancy yet... Just few lines of code to see where this can go. Let me know if you want to take it from here and contribute or if I need to implement it (longer delay I'm afraid :) ).

sn0opr commented 5 years ago

@dadoonet thank you for pushing the branch, it looks clear, we will work on it and do a pull request ASAP. Thank's again!

dadoonet commented 1 year ago

@sn0opr I'm wondering if you did anything on your side regarding this feature?

Thurdi commented 9 months ago

Howdy, was wondering if there has been any progress on extracting and indexing windows ACL.

dadoonet commented 8 months ago

Not on my side sadly. Wanna work on it?

Thurdi commented 8 months ago

I have a WIP fork that I've been tinkering with here. I'll get it polished up as best as I can, but feel free to give it a look over and let me know your thoughts.

dadoonet commented 5 months ago

Hey @Thurdi

Sorry for the delay. Would you like to create a proper branch for this in your fork and then send a draft PR so we can more easily discuss on that?