Open Denis4545 opened 6 years ago
Because a txt file has no metadata associated with it in its file.
@dadoonet Hello David. +1 Could you explain more detail, please? I thought, that there are some base attributes for any file in any file system like "date created", "date modified", "size" and etc., that should be extracted by fscrawler and load into elasticsearch. The problem is, that we are working on global search inside our company and chose your crawler (thanks for this project!) to create indexes for all FTP servers. We have crawled all FTP servers and for now we are working on front-end for search system. There are should be some filters for users (like in google) by extension of file, by date created, date modified and etc to get filtered results. But we faced with this problem, that there is no "date created" information for at least .txt, .log, *.html files. I mean, that crawler doesn't extract these fields from these files. It makes our filter in front-end partial and filter will not work for all files. Do you know, why it happens with "date created" field and how we can fix this issue? Thanks.
Reopening to think about it.
@dadoonet Hello David. Thanks for your time. I will wait for your thoughts, because it is important to us and we cannot going forward without understanding this issue. Thanks again.
May I should try to extract more data coming from the filesystem and add that here:
And then if something is provided when the Tika extraction is done, overwrite the "FS" value. Like what we do in: https://github.com/dadoonet/fscrawler/blob/master/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaDocParser.java#L108-L110
That said, I'm pretty sure that some data will depends on the FS implementation. So it might be only a few things that we can capture... Or it will mean first FS detection, then try to extract more or less data depending on that which will take more time to fix IMO.
Can you list exactly what are the fields you need for your project as a MVP?
@dadoonet Hello David. If you are asking about it, I would like to ask to add an opportunity to extract ACL list from every file too. That will be very cool feature, that allows us to set security level in front-end, and user will get only those files in results, to which he has at least read-only access in file system. As all our FTP servers are on Windows Server, then I'm talking about windows NTFS ACL list, like this:
If you add "Date Created" and ACL list in index, it will be incredible support from your side. All other fields are ok for us. If adding of ACL list is not possible, then the field, that we need is "Date Created" field only. All other fields are ok for us. Thanks!
May be I can use that: https://myshittycode.com/2013/09/10/reading-directoryfiles-acl-directly-from-java/
I need to give it a try when I'll have time.
@dadoonet Hello David. That's perfect. I will wait for your response. And maybe do you implement "Date created" field at first, because it is more important for us for now) Thanks again!
@dadoonet Happy new year! First i want to thank you for all job you done. I have to get ACL from windows file with fscrawler, hope there is news for this issue.. thanks.
So I'm working on that. Sadly I'm a bit blind as working on MacOS and not on Windows :( I'll try to see if I can run tests afterwards on Windows though.
Anyway, what could be a good representation of the data? Let's take as an example that we have:
Owner:
BUILTIN\Administrators (Alias)
ACL:
NT AUTHORITY\SYSTEM (Well-known group) [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]
BUILTIN\Administrators (Alias) [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]
MYDOMAIN\thundercat (User) [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]
This is somewhat related to #567 where we generated:
"attributes": {
"owner": "david",
"group": "staff",
"permissions": 764
},
I believe that owner
in such a case should be replaced by BUILTIN\Administrators (Alias)
.
And that I should add an acl
structure like:
"attributes": {
"owner": "david",
"acl": [{
"user": "NT AUTHORITY\\SYSTEM (Well-known group)",
"access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
},{
"user": "BUILTIN\\Administrators (Alias)",
"access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
},{
"user": "MYDOMAIN\\thundercat (User)",
"access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
}
]
},
With attributes.acl
defined in mapping as a nested
type.
But this will have consequences in the number of documents indexed in Lucene at the end.
Wondering what would be the representation. What do you think @konovalcev @hatemjaafar @Denis4545 ?
Ping @konovalcev @hatemjaafar @Denis4545. Any thoughts?
@dadoonet Thank you for this project, it's helping us indexing thounsds of files for our client.
Did you added the acl feature to FSCrawler ? this really usefull.
@sn0opr Nope. This issue is still opened. Do you have any idea about my proposals here: https://github.com/dadoonet/fscrawler/issues/464#issuecomment-409589061 ?
Thank you @dadoonet for the quick reply, Actually that's what we are looking for :) . I think this structure let us know the access types for each user.
"attributes": {
"owner": "david",
"acl": [{
"user": "NT AUTHORITY\\SYSTEM (Well-known group)",
"access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
},{
"user": "BUILTIN\\Administrators (Alias)",
"access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
},{
"user": "MYDOMAIN\\thundercat (User)",
"access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
}
]
},
Please can you push the branch that you are working on for this feature, I would like to help in this part if possible. Thank's
I pushed that here: https://github.com/dadoonet/fscrawler/tree/wip/acl
But nothing fancy yet... Just few lines of code to see where this can go. Let me know if you want to take it from here and contribute or if I need to implement it (longer delay I'm afraid :) ).
@dadoonet thank you for pushing the branch, it looks clear, we will work on it and do a pull request ASAP. Thank's again!
@sn0opr I'm wondering if you did anything on your side regarding this feature?
Howdy, was wondering if there has been any progress on extracting and indexing windows ACL.
Not on my side sadly. Wanna work on it?
I have a WIP fork that I've been tinkering with here. I'll get it polished up as best as I can, but feel free to give it a look over and let me know your thoughts.
Hey @Thurdi
Sorry for the delay. Would you like to create a proper branch for this in your fork and then send a draft PR so we can more easily discuss on that?
Hello. We are crawling FTP servers and I've noted that based metadata objects (meta.created, meta.metadata_date, meta.author) aren't extracted from txt, log, csv, html, htm files. Could you explain please why based objects aren't extracted from mentioned types files? See the example of text files below please: