Sync job gets marked as Succeeded even if there was a error that ended the job prematurely

gcasar commented 6 months ago

Bug Description

After an internal connector error that prematurely ends a Full sync the Sync job proceeds, reconciles the index by deleting documents previously in the system and is marked as Sync complete

To Reproduce

Steps to reproduce the behavior:

This behavior is hard to reproduce reliably because it is 1) Connector specific and 2) Relies on the third party error. There are at least two bugs that surface this problem.

Please note that the example I'm giving below is not the only place this happens.

06:59:51.107
enterprise_search.connectors
[enterprise_search.connectors][warning] Skipping data for type content from https://stackoverflow.atlassian.net/wiki/rest/api/content/search?next=true&cursor=_f_NTA%3D_sa_WyJcdDM2NjM5OTMwIFI2KF8wZUpoJWhYdDA8L2w8a2NcIiBjcCJd&expand=children.attachment,history.lastUpdated,body.storage,space,space.permissions,restrictions.read.restrictions.user,restrictions.read.restrictions.group&limit=50&start=285750&cql=space+in+('LTDD','BATD','MADCOW')+AND+type=page. Exception: Response payload is not completed.
06:59:53.811
enterprise_search.connectors
[enterprise_search.connectors][info] Sync ended with status completed -- created: 0 | updated: 285753 | deleted: 23748 (took 14338 seconds)

Expected behavior

When an exception occurs that forces a Full sync to prematurely end the Sync job should be marked as Failed and the existing documents in the index should not get Deleted.

As a user of connector I expect this problem to be handled in a shared way so that uncaught exceptions in connectors don't get silently discarded giving a false sense of completed syncs.

Screenshots

Please note that this is an older image, but we are still able to reproduce this in 8.13.2 where the errors reported above are from.

Environment

Elastic Cloud, version 8.13.2, native Confluence cloud and Github cloud connectors

Additional context

The second example is related and for Github:

06:39:29.252
enterprise_search.connectors
[enterprise_search.connectors][warning] Something went wrong while fetching the issues. Exception: 403, message='Forbidden', url=URL('https://api.github.com/graphql')
  File "/usr/share/enterprise-search/lib/python3.10/site-packages/connectors/sources/github.py", line 1542, in _fetch_issues
    async for response in self.github_client.paginated_api_call(
/...several lines ommited ../
06:39:29.253
enterprise_search.connectors
[enterprise_search.connectors][info] Fetching files from repo: 'elastic/elasticsearch' (branch: 'main')
06:39:40.101
enterprise_search.connectors
[enterprise_search.connectors][info] Sync ended with status completed -- created: 2543 | updated: 34290 | deleted: 34693 (took 13176 seconds)

While by no means an expert in your codebase, for at least Confluence the following code bits could explain the behavior:

1.) 3rd party (Confluence) closes the session (this should be somewhat expected and recoverable) 2.) Session is closed in a bad way, not nulling it out, should likely use the internal close_session method to unset the private variable

3.) The call is retried a few times, but the session is not recreated because it was not unset https://github.com/elastic/connectors/blob/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/confluence.py#L158-L170 4.) Paginated API call eats the propagated exception https://github.com/elastic/connectors/blob/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/confluence.py#L194-L198 3.) Wrapper code continues and reconciles the Index state (by updating and deleting)

Step 4 is the problem in this connectors case.

The above links are pinned to a commit from 8.13.2 but I am also able to find a similar smell in the latest main branch for jira (main branch coresponding problem) but the error handling is more robust there (example) while confluence does not handle rate limits at all.

### Subtasks
- [ ] https://github.com/elastic/connectors/issues/2394
- [ ] https://github.com/elastic/connectors/issues/2395
- [ ] https://github.com/elastic/connectors/issues/2396
- [ ] https://github.com/elastic/connectors/issues/2397
- [ ] https://github.com/elastic/connectors/issues/2403

seanstory commented 6 months ago

Hi @gcasar , thanks for filing.

4.) [Something] eats the propagated exception Step 4 is the problem in this connectors case.

Agreed, this looks like the problem. We try to balance between the ideologies of:

"don't just fail the sync if there's some minor error, move on"
"fail fast if there's an error"

While we shoot for a reasonable balance, it looks like you've definitely identified a few instances here where we've missed the mark. Unfortunately for us, and as you've already identified, fixing this needs to be done on a case-by-case (and connector-by-connector) basis. Looks like you've identified at least two instances here:

[ ] confluence swallows any error that occurs during pagination
[ ] github swallows errors even if fetching issues fails

On top of those, looks like the underlying bugs that cause this to be a problem are:

[ ] confluence session close should unset session variable
[ ] github 403 not being caught as an UnauthorizedError (link)

If the above sounds right to you, we can make those 4 into discrete subtasks, and work to prioritize them.

gcasar commented 6 months ago

Thank you for the fast response, the balancing makes sense in this context and seems like a smart approach 😄

I see how splitting them up would help, in that case I'd add one more:

[ ] Confluence does not handle rate limiting

I'll follow up on this tomorrow morning if you'd like me to split the issue up myself - I'm happy to do that!

seanstory commented 6 months ago

see how splitting them up would help

excellent, I'll do that now.

I'd add one more ... if you'd like me to split the issue up myself - I'm happy to do that!

I've got enough info to create the first 4 subtasks I mentioned. If you'd create a separate issue for the confluence rate limiting (definitely something we should handle) and provide relevant logs/traces/links, that would be excellent. Thanks!

gcasar commented 6 months ago

I was not able to replicate the Confluence 429 that we encountered previously to get a proper bug report back to you. I don't want to detract from others in that list until I can do a more diligent report on my end and reproduce it - currently I can highlight the discrepancy between Jira and Confluence where Jira nicely handles it but Confluence retries a few times without waiting. Sadly, there are no docs from Atlassian around this but it might be fair to assume that both systems behave in a similar way.

This is the code that I'm referring to https://github.com/elastic/connectors/blame/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/jira.py#L168-L182

seanstory commented 6 months ago

Ok no worries. I've just made a pretty minimal issue in https://github.com/elastic/connectors/issues/2403. Please add any details/stack traces/logs/ etc to that if you hit it again. Otherwise, we'll work with what we have.

elastic / connectors