Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

feat/migrate sharepoint src #3314

Open rbiseck3 opened 5 days ago

rbiseck3 commented 5 days ago

Description

Migrate over the sharepoint connector to v2 and in the process refactor the majority of the connector. It now pulls in much more content from the SDK on index time, including permissions data is the parameters are passed in. HTML content generated from the SitePage is isolated to the html content in the CanvasContent1 and LayoutWebpartsContent returned by the SDK.

Some TODOs were left in there for future iterations. Currently only document and site page content is being pulled in from sharepoint, but sharepoint has more types of content than just that, such as lists. Note left in there to support other sharepoint types.

potter-potter commented 5 days ago

@rbiseck3 What is going on with the expected-structured-output? Lots of documents deleted and some name changes? But no new documents to replace them?

ahmetmeleq commented 4 hours ago

I couldn't see any permission data:

            "id": permission.id,
            "roles": list(permission.roles),
            "share_id": permission.share_id,
            "has_password": permission.has_password,
            "link": permission.link.to_json(),
            "granted_to_identities": permission.granted_to_identities.to_json(),
            "granted_to": permission.granted_to.to_json(),
            "granted_to_v2": permission.granted_to_v2.to_json(),
            "granted_to_identities_v2": permission.granted_to_identities_v2.to_json(),
            "invitation": permission.invitation.to_json(),

within the working directory for sharepoint-with-permissions.sh when I removed metadata.data_source.permissions_data from --metadata-exclude.

Is this expected? I was able to see 2024-07-03 12:45:05,385 MainProcess DEBUG Enriching permissions on files in my logs

Edit 1: Debugging, I see that I cannot obtain the permissions client Edit 2: Narrowed it down to: permissions_config=None, so it might be a cli / config generation problem