Result counts returned to the user are inconsistent depending on whether the annotator is returning annotations

NCATSTranslator / Relay

Autonomous relay system for NCATS Biomedical Data Translator

MIT License

5 stars 24 forks source link

Result counts returned to the user are inconsistent depending on whether the annotator is returning annotations #681

Open sstemann opened 3 months ago

sstemann commented 3 months ago

ARS in Test has a recurring but intermittent issue that has two symptoms:

UI only displays the first/fastest ARA (usually Improve)
UI displays no results

BUt when I look at these PKs in the ARAX GUI, I'm seeing the majority of ARAs respond.

SO then if I run the same query again, sometimes I get all expected results to the UI and sometimes i get whatever case (1 or 2) that i didnt start with. It's very frustrating for testing.

I also think we cannot go to Prod with this behavior, as its very inconsistent for users.

ARS Error 444.xlsx

ShervinAbd92 commented 3 months ago

Hi @sstemann I investigated all the Pks you shared in this file. for partially shown UI results, what happens is that the first ARA returns and node annotator behaves normal, but as soon as the second ARA send results to annotator we get the following error format
Connection broken: IncompleteRead(??? bytes read, ??? more expected)', IncompleteRead(??? bytes read, ??? more expected)) for some cases we are getting this error from the get go. so i believe maybe annotator has to optimize their config settings to be able to handle different incoming sizes? what do you think @newgene

sierra-moxon commented 3 months ago

from TAQA: could be an annotator stability issue? right now the UI is not disambiguation between fatal and non-fatal error codes and won't display results if the annotator is not returning a 200.

MarkDWilliams commented 3 months ago

It's not quite clear to me yet if the connection is failing to finish due to an issue on the ARS side or the Annotator is failing to finish transmitting the data over. In the past, when we've seen similar errors on the ARS side, it was data size/timeout limitations in the configuration of the deployment. However, the sizes of payloads that the Annotator is dealing with aren't much bigger than the ARA returns that are coming back, and those seem to be getting received and written to the database fine.

maximusunc commented 3 months ago

Based on the automated tests from this past weekend (8/11), this issue is also observed in CI.

ShervinAbd92 commented 3 months ago

Update: ARS is going to implement brotli compression feature on nodes before sending them to the annotator. this is suggested to help with large data size that annotator seems to have trouble processing currently

sstemann commented 2 months ago

we're not moving forward with the new Annotator Service in Fugu.

@ShervinAbd92 has implemented a "local" annotator in CI and we may deploy ARS out-of-cycle. I believe this is not the permanent solution.

sierra-moxon commented 2 months ago

from TAQA: intermittent and hard to debug - ITRB and Annotator interaction/debugging on a dev instance. shim is a locally annotator service in ARS in CI (to avoid ITRB issues), we are moving that forward to TEST. parallel track, annotator team is working to stabilize the standalone service, waiting on Guppy deployment to TEST to test that. plan: push the ARS contained annotator service to PROD off cycle, then go back to working on standalone service.

ctrl-schaff commented 2 months ago

In terms of notes on the debugging effort for annotator I've disabled the compression middleware we were applying due to the CPU bottlenecking it seemed to create when handling the responses. I've taken a batch of the queries from our annotation logs and I'm now trying to systemically recreate the issue by querying against a local instance of the server versus the ci environment to highlight any differences in responses. As I find more information from testing I'll keep you informed

ShervinAbd92 commented 3 days ago

Updates from Johnathan 11/10/2024 He still hasn't found a solid reason as to why annotator returns HTTP 500 under load, but he has released a version of "biothings_client" package that provides an async client and he plans to integrate into the annotator server which in return should reduce blocking. They are also planning to update their deployment with a reverse proxy with caddy to implement the compression outside of the web server.