Open Gleb-Gadyatskiy opened 1 month ago
some of them have invalid legacy documents
@Gleb-Gadyatskiy can you share what type of errors you're hitting? Is it that the Confluence API calls we're making fail against some spaces, or that Elasticsearch ingestion fails because fields are of the wrong type, or something else?
Also, have you looked at using Advanced Sync Rules to filter out the spaces you don't want to ingest? This should provide a more efficient sync than just crawling everything but ignoring errors, as it doesn't bother to retrieve or process any of the "bad" spaces.
Error response from Confluence:
{"statusCode":500,"message":"","reason":"Internal Server Error"}
Error in Confluence logs:
NotFoundException: Cannot find content <page: 2022- Product Specific Implementations v.10>. Outdated version/old_draft/trashed? Please provide valid ContentId.: [SimpleMessage{key='null', args=[], translation='Cannot find content <page: 2022- Product Specific Implementations v.10>. Outdated version/old_draft/trashed? Please provide valid ContentId.'}]
I do not like to individually pick Spaces because we have 1,000+ of them :) It takes me days to run ingestion for each one individually to find which should be excluded
Another nice solution would be possibility to ingest documents one-by-one and skip broken
For now I wrote a custom Python script to scan all spaces, read all documents there and list where 500 error happened. The script found 17 spaces with the error (out of 1K), which I excluded from ingestion
Gotcha. Yes, for page errors, it makes sense that we should be able to just move past single page problems.
If it's not just pages, but space records themselves that return this error, I'm less inclined to just move past those, at least by default. I think a new config may be called for, that defaults to "false"/"off"/"disabled" for Skip outdated/trashed Spaces?
which could catch errors with this type of messaging, and skip the whole space.
Hi, @seanstory, currently the logic is designed so that Confluence spaces are indexed in a parallel task along with tasks for the pages, blogs and attachments. In order to ingest documents space by space, we'll need to update the logic such that connector fetch all spaces first and then move to parallel tasks based on each space. This might impact performance a bit. Let us know your thoughts.
Good point. I wouldn't expect it to impact performance that much, but would be a decent bit of refactoring, I'd expect. Do you have an alternate suggestion on how to identify and move past these sorts of errors in an efficient way?
Not as of now, I'll look into it and get back to you
Problem Description
RE: https://github.com/elastic/connectors/issues/2574 Our Confluence has 1K spaces and some of them have invalid legacy documents. As a result ingestion of all Spaces fails and it is impossible to find what was ingested and what not. Ingesting each Space in own index one by one is not feasible too (1K of Spaces).
Proposed Solution
Allow to ingest Space by space automatically (retrieve list of spaces in a separate REST API call) and ignore/report which namespace cannot be ingested, but do not stop process if a Space ingestion failed. As a result all recent places will be ingested and all problematic will be reported, so later I will be able to exclude them
Alternatives
Allow to do not stop if ingestion of some documents failed and continue ingestion of others.