elastic / connectors

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
74 stars 126 forks source link

Access control sync to main connector index #2303

Closed sjors101 closed 6 months ago

sjors101 commented 6 months ago

Problem Description

Access control fields of objects change a lot, this requires use to craw frequently to keep up with the latest changes. However full content sync are quite costly.

We would like to have the option when we run the Access control sync to be able to update the documents with access control fields in the main index search-<INDEX-NAME>. Currently the framework is designed, when we run an Access control sync, it only ingests the dls profile in the acl index .search-acl-filter-<INDEX-NAME>.

Proposed Solution

It would be nice to have an option in the get_access_control function in the BaseDataSource, to also ingest to the main index (search-<INDEX-NAME>) of an connector.

Alternatives

Add a variable to switch two paths in the Full content sync. One for the full content, and one for just the access control fields.

danajuratoni commented 6 months ago

@sjors101 thanks for filing the enhancement!

At which data source are you looking? You might want to consider running an incremental sync on your content, that would update only the changes that occurred since the last content sync.

sjors101 commented 6 months ago

Mainly the sharepoint server data source. I wrote some code that ingests the allow_access_control fields (considering pushing it back to the community once it is mature enough). I had a look at incremental sync, however with Sharepoint on-prem it seems it does not have a timestamp / indicator when the access control of an object changes.

seanstory commented 6 months ago

I'd definitely be curious to see the changes you're making, even if they're still in-progress. My initial reaction is that this wouldn't be possible to do in most cases, because Full syncs and Access Control syncs are often fetching two disparate sets of data from the 3rd party. In order to make Access Control syncs update all the documents impacted by permissions, you'd be significantly increasing the scope of your Access Control sync, and then you're back in the place you started, where "full content sync are quite costly."

I'm inclined to agree with Dana, that the "right" solution to this problem is Incremental syncs. Which, as you say, are not in a perfect spot with the Sharepoint Server today. But that's where I'd direct investment to solve this problem.

However, if you've spotted an approach I'm not thinking of, I'd love to understand better. Don't hesitate to put up a Draft PR. :)

sjors101 commented 6 months ago

My project is leaning heavy on Sharepoint server as datasource, i made quite some changes to make sure the connector fit our needs. I will try to push some stuff your way soon :)

The main bottleneck is Sharepoint not tracking a modified timestamp when permissions are changed. It seems it can be enabled in an audit log, but thats not something I would like to use. Guess i will see if i can build something like a lightweight full sync to just get just the permissions synced up. Thanks Sean, Dana for your response.