elastic / connectors

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
69 stars 121 forks source link

[Sharepoint] non-path-based site collections break connector #2112

Closed seanstory closed 1 month ago

seanstory commented 7 months ago

Describe the bug

In Sharepoint (Online or Server), sites can be logically grouped in "Site Collections," where there is a root site, and then a bunch of child sites. Thus far, we've only seen setups where there is a single Site Collection at <tenant-name>.sharepoint.com/sites/. And this seems to have lead us to assume that the tenant-name is tightly coupled to the hostname for all sites/site collections. However, from https://learn.microsoft.com/en-us/sharepoint/sites/sites-and-site-collections-overview:

Microsoft recommends that customers use path-based site collections as they're easier to manage through PowerShell and Central Administration. However, if customers need to host multiple site collections, with each site collection having its own DNS name, they can opt to deploy host-named site collections. Host-named site collections require additional administration to ensure each site collection is correctly registered with DNS names and Service Principal Names (SPNs).

While it may not be recommended, it does seem that it's valid to have non-path-based site collections, which can mean that the hostname for a given site collection is NOT prefixed with the tenant-name. That makes checks like these behave incorrectly.

https://github.com/elastic/connectors/blob/9fcdf5e308c9657e092116a6a6568002c32d6a47/connectors/sources/sharepoint_online.py#L1020-L1025

Further, it can mean that we attempt to sync (and fail to sync) some sites on site collections we didn't intend to.

For example - let's say a user has one tennant acmecorp. They have two site collections:

In our connector, if you configured tenant_name: acmecorp and sites: foo and enumerate_all_sites: false, we would successfully fetch acmecorp.sharepoint.com/sites/foo, but then we'd also try to fetch acmecorpb.sharepoint.com/sites/foo, decide that we were trying to fetch something from another tenant, then fail the sync. And there would be no way for us to successfully fetch only acmecorpb.sharepoint.com/sites/foo because its hostname will never align with its tenant name.

To Reproduce

  1. have a sharepoint online setup that uses non-path-based site collections
  2. ensure that two site collections have sites with the same relative paths
  3. try to run a sync where that relative path is specified as the site to be synced
  4. see the sync fail, with a misleading error saying that the tenant name is wrong.

Expected behavior

Environment

8.13.0-SNAPSHOT and before

Additional context

slack thread: https://elastic.slack.com/archives/C7LLL50CA/p1705426484605769

artem-shelkovnikov commented 7 months ago

Oh wow, amazing job on triaging, Sean!

danajuratoni commented 2 months ago

@moxarth-elastic It would be good to spawn off a sub-ticket that focuses only on Server when you'll pick this up. Thanks!

moxarth-elastic commented 1 month ago

Hi Team, The PR for adding a support for Host Named Site collection in the Sharepoint Server is merged. For Sharepoint Online, we did some research on How to create Host Named Site collections and as per the documentation, it shows there is no support of Host Named Site collections for Sharepoint in Microsoft 365 and seems there is no concept of Host Named Site Collection in SPO. Hence, closing this ticket with the fix in Sharepoint Server connector.