Open gcasar opened 6 months ago
Hi @gcasar , thanks for filing.
4.) [Something] eats the propagated exception Step 4 is the problem in this connectors case.
Agreed, this looks like the problem. We try to balance between the ideologies of:
While we shoot for a reasonable balance, it looks like you've definitely identified a few instances here where we've missed the mark. Unfortunately for us, and as you've already identified, fixing this needs to be done on a case-by-case (and connector-by-connector) basis. Looks like you've identified at least two instances here:
On top of those, looks like the underlying bugs that cause this to be a problem are:
If the above sounds right to you, we can make those 4 into discrete subtasks, and work to prioritize them.
Thank you for the fast response, the balancing makes sense in this context and seems like a smart approach 😄
I see how splitting them up would help, in that case I'd add one more:
I'll follow up on this tomorrow morning if you'd like me to split the issue up myself - I'm happy to do that!
see how splitting them up would help
excellent, I'll do that now.
I'd add one more ... if you'd like me to split the issue up myself - I'm happy to do that!
I've got enough info to create the first 4 subtasks I mentioned. If you'd create a separate issue for the confluence rate limiting (definitely something we should handle) and provide relevant logs/traces/links, that would be excellent. Thanks!
I was not able to replicate the Confluence 429 that we encountered previously to get a proper bug report back to you. I don't want to detract from others in that list until I can do a more diligent report on my end and reproduce it - currently I can highlight the discrepancy between Jira and Confluence where Jira nicely handles it but Confluence retries a few times without waiting. Sadly, there are no docs from Atlassian around this but it might be fair to assume that both systems behave in a similar way.
This is the code that I'm referring to https://github.com/elastic/connectors/blame/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/jira.py#L168-L182
Ok no worries. I've just made a pretty minimal issue in https://github.com/elastic/connectors/issues/2403. Please add any details/stack traces/logs/ etc to that if you hit it again. Otherwise, we'll work with what we have.
Bug Description
After an internal connector error that prematurely ends a Full sync the Sync job proceeds, reconciles the index by deleting documents previously in the system and is marked as Sync complete
To Reproduce
Steps to reproduce the behavior:
This behavior is hard to reproduce reliably because it is 1) Connector specific and 2) Relies on the third party error. There are at least two bugs that surface this problem.
Please note that the example I'm giving below is not the only place this happens.
Expected behavior
When an exception occurs that forces a Full sync to prematurely end the Sync job should be marked as Failed and the existing documents in the index should not get Deleted.
As a user of connector I expect this problem to be handled in a shared way so that uncaught exceptions in connectors don't get silently discarded giving a false sense of completed syncs.
Screenshots
Please note that this is an older image, but we are still able to reproduce this in 8.13.2 where the errors reported above are from.
Environment
Additional context
The second example is related and for Github:
While by no means an expert in your codebase, for at least Confluence the following code bits could explain the behavior:
1.) 3rd party (Confluence) closes the session (this should be somewhat expected and recoverable) 2.) Session is closed in a bad way, not nulling it out, should likely use the internal close_session method to unset the private variable
3.) The call is retried a few times, but the session is not recreated because it was not unset https://github.com/elastic/connectors/blob/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/confluence.py#L158-L170 4.) Paginated API call eats the propagated exception https://github.com/elastic/connectors/blob/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/confluence.py#L194-L198 3.) Wrapper code continues and reconciles the Index state (by updating and deleting)
Step 4 is the problem in this connectors case.
The above links are pinned to a commit from 8.13.2 but I am also able to find a similar smell in the latest main branch for jira (main branch coresponding problem) but the error handling is more robust there (example) while confluence does not handle rate limits at all.