elastic / connectors

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
58 stars 117 forks source link

Confluence connector: allow to ingest each space separatelly #2578

Open Gleb-Gadyatskiy opened 1 month ago

Gleb-Gadyatskiy commented 1 month ago

Problem Description

RE: https://github.com/elastic/connectors/issues/2574 Our Confluence has 1K spaces and some of them have invalid legacy documents. As a result ingestion of all Spaces fails and it is impossible to find what was ingested and what not. Ingesting each Space in own index one by one is not feasible too (1K of Spaces).

Proposed Solution

Allow to ingest Space by space automatically (retrieve list of spaces in a separate REST API call) and ignore/report which namespace cannot be ingested, but do not stop process if a Space ingestion failed. As a result all recent places will be ingested and all problematic will be reported, so later I will be able to exclude them

Alternatives

Allow to do not stop if ingestion of some documents failed and continue ingestion of others.

seanstory commented 1 month ago

some of them have invalid legacy documents

@Gleb-Gadyatskiy can you share what type of errors you're hitting? Is it that the Confluence API calls we're making fail against some spaces, or that Elasticsearch ingestion fails because fields are of the wrong type, or something else?

seanstory commented 1 month ago

Also, have you looked at using Advanced Sync Rules to filter out the spaces you don't want to ingest? This should provide a more efficient sync than just crawling everything but ignoring errors, as it doesn't bother to retrieve or process any of the "bad" spaces.

Gleb-Gadyatskiy commented 1 month ago

Error response from Confluence:

{"statusCode":500,"message":"","reason":"Internal Server Error"}

Error in Confluence logs:

NotFoundException: Cannot find content <page: 2022- Product Specific Implementations v.10>. Outdated version/old_draft/trashed? Please provide valid ContentId.: [SimpleMessage{key='null', args=[], translation='Cannot find content <page: 2022- Product Specific Implementations v.10>. Outdated version/old_draft/trashed? Please provide valid ContentId.'}]

I do not like to individually pick Spaces because we have 1,000+ of them :) It takes me days to run ingestion for each one individually to find which should be excluded

Gleb-Gadyatskiy commented 1 month ago

Another nice solution would be possibility to ingest documents one-by-one and skip broken

Gleb-Gadyatskiy commented 1 month ago

For now I wrote a custom Python script to scan all spaces, read all documents there and list where 500 error happened. The script found 17 spaces with the error (out of 1K), which I excluded from ingestion

seanstory commented 1 month ago

Gotcha. Yes, for page errors, it makes sense that we should be able to just move past single page problems.

If it's not just pages, but space records themselves that return this error, I'm less inclined to just move past those, at least by default. I think a new config may be called for, that defaults to "false"/"off"/"disabled" for Skip outdated/trashed Spaces? which could catch errors with this type of messaging, and skip the whole space.

praveen-elastic commented 1 week ago

Hi, @seanstory, currently the logic is designed so that Confluence spaces are indexed in a parallel task along with tasks for the pages, blogs and attachments. In order to ingest documents space by space, we'll need to update the logic such that connector fetch all spaces first and then move to parallel tasks based on each space. This might impact performance a bit. Let us know your thoughts.

seanstory commented 1 week ago

Good point. I wouldn't expect it to impact performance that much, but would be a decent bit of refactoring, I'd expect. Do you have an alternate suggestion on how to identify and move past these sorts of errors in an efficient way?

praveen-elastic commented 1 week ago

Not as of now, I'll look into it and get back to you