Closed amnonkhen closed 3 years ago
After some investigation we found an issue with the state tracker in the logs:
2021-01-29 17:41:18.846 ERROR 6 --- [ main] o.h.i.state.persistence.RedisPersister : Failed to retrieve state machine with id: 7210fb46-add2-418f-905a-13f6d4f52753
com.esotericsoftware.kryo.KryoException: Unable to find class: org.humancellatlas.ingest.state.monitor.util.BundleTracker
at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:156) ~[kryo-shaded-3.0.3.jar!/:na]
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) ~[kryo-shaded-3.0.3.jar!/:na]
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) ~[kryo-shaded-3.0.3.jar!/:na]
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) ~[kryo-shaded-3.0.3.jar!/:na]
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:161) ~[kryo-shaded-3.0.3.jar!/:na]
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39) ~[kryo-shaded-3.0.3.jar!/:na]
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) ~[kryo-shaded-3.0.3.jar!/:na]
...
It seems to be that my initial thoughts in this PR are not the reason for the problem. Having said that, it is weird that we have findByUuid
and findByUuidUuid
endpoints for submissions when it is not the case for other metadata types - we should fix that but it is not urgent
Findings for today:
There were errors when the state tracker was deployed 16 days ago, 29th January. The errors were not raised and were logged silently. And since the logs weren't checked, these errors were not discovered at that time. See logs.
There were 2 error types:
2021-01-29 17:41:20.720 ERROR 6 --- [ main] o.h.ingest.state.persistence.AutoLoader : Failed to autoload statemachine for envelope with UUID: e984c0a8-7a8d-4a23-ac2a-596b3ab2b128
2021-02-15 16:38:05.230 WARN 6 --- [ main] o.h.ingest.state.persistence.AutoLoader : HTTP error: Failed to retrieve envelope reference for envelope with UUID: d2eb03b7-4736-4c11-b01d-ab4d852133a2
TLDR:
There were issues in restoring the state machines in the state tracker and the state tracker database is out of sync from Ingest. We need to investigate further why the data is out of sync and we need to sync the data.
There's a workaround for valid submissions. We could set a metadata in that submission back to Draft and wait for submission go to Valid again before exporting (clicking submit).
curl -X PUT -H "Authorization: Bearer $TOKEN" <insert-biomaterial-url>/draftEvent
I was able to setup the state tracker, core, validator (broker, ui for easier testing) locally and was able to debug and replicate the issue.
Here are some of the things i've found/confirmed:
~1. There's a circular dependency between state tracker and core.~ ~Core is using http to send messages to the state tracker and would need the ip/url of the service.~ ~State tracker queries core with some info from submission envelope using uuid to load the data during its initialisation~ ~There is no issue in deploying the state tracker after core atm. The state tracker service ip seems to not change within the cluster, but i think this is not guaranteed.~ ~We probably would want to change core -> state tracker's sending of messages so that it wouldn't need the state tracker service/endpoint. It used to be via RabbitMQ topic exchange, but the order of messages wasn't guaranteed it was quickly changed to be http and I believe it's when the circular dependency was introduced. I believe we should be able to use RabbitMQ direct exchange if we want to make the order of messages guaranteed. This would also make the sending of events between our internal components consistent.~ (2021/02/15 Update ) I need to further check why I thought there's an issue if we deploy core first before the state tracker
State tracker’s state machines are being persisted every minute. If there are errors in the loading of state machines during its initialisation/deployment it's going to get persisted. This is the most likely the reason why the state tracker data is corrupted/out of sync from ingest db.
The data we store in Redis is the state machine context object and is not in a human readable format. If we're going to correct the data we could probably do it in the code during the state tracker's loading/initialisation.
Actions:
The AutoLoader process in the state tracker ran successfully in dev:
https://app.zenhub.com/files/299258791/357914b6-69a7-40da-992d-656f2997cb86/download
The integration test is passing in dev: https://gitlab.ebi.ac.uk/hca/ingest-integration-tests/-/jobs/371351
PR is merged: https://github.com/ebi-ait/ingest-state-tracking/pull/4 , deployed in staging The integration test is passing in staging https://gitlab.ebi.ac.uk/hca/ingest-integration-tests/-/pipelines/136211
This was noticed on 11/2 by @rays22, see #175. It can be tracked to a deployment of the exporter from 29/1