ebi-ait / dcp-ingest-central

Central point of access for the Ingestion Service of the HCA DCP
Apache License 2.0
0 stars 0 forks source link

project status does not change to "Exporting" #178

Closed amnonkhen closed 3 years ago

amnonkhen commented 3 years ago

This was noticed on 11/2 by @rays22, see #175. It can be tracked to a deployment of the exporter from 29/1

jacobwindsor commented 3 years ago

After some investigation we found an issue with the state tracker in the logs:

2021-01-29 17:41:18.846 ERROR 6 --- [           main] o.h.i.state.persistence.RedisPersister   : Failed to retrieve state machine with id: 7210fb46-add2-418f-905a-13f6d4f52753
com.esotericsoftware.kryo.KryoException: Unable to find class: org.humancellatlas.ingest.state.monitor.util.BundleTracker
        at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:156) ~[kryo-shaded-3.0.3.jar!/:na]
        at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) ~[kryo-shaded-3.0.3.jar!/:na]
        at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) ~[kryo-shaded-3.0.3.jar!/:na]
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) ~[kryo-shaded-3.0.3.jar!/:na]
        at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:161) ~[kryo-shaded-3.0.3.jar!/:na]
        at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39) ~[kryo-shaded-3.0.3.jar!/:na]
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) ~[kryo-shaded-3.0.3.jar!/:na]

...

It seems to be that my initial thoughts in this PR are not the reason for the problem. Having said that, it is weird that we have findByUuid and findByUuidUuid endpoints for submissions when it is not the case for other metadata types - we should fix that but it is not urgent

aaclan-ebi commented 3 years ago

Findings for today:

There were errors when the state tracker was deployed 16 days ago, 29th January. The errors were not raised and were logged silently. And since the logs weren't checked, these errors were not discovered at that time. See logs.

There were 2 error types:

  1. There were some connection errors in retrieving the submission envelope from core. The errors happened for Ray and Marion's submissions (cb156730-90b0-4b77-944c-bfc263204c61 , feebef33-476f-42e4-9bfa-6da47fb0d0f9 )
    2021-01-29 17:41:20.720 ERROR 6 --- [           main] o.h.ingest.state.persistence.AutoLoader  : Failed to autoload statemachine for envelope with UUID: e984c0a8-7a8d-4a23-ac2a-596b3ab2b128
  1. There were also 2 errors about missing BundleTracker class.
    • We didn't investigate this as this is not affecting the submission envelopes we want to export

TLDR:

  1. There were issues in restoring the state machines in the state tracker and the state tracker database is out of sync from Ingest. We need to investigate further why the data is out of sync and we need to sync the data.

  2. There's a workaround for valid submissions. We could set a metadata in that submission back to Draft and wait for submission go to Valid again before exporting (clicking submit).

curl -X PUT -H "Authorization: Bearer $TOKEN" <insert-biomaterial-url>/draftEvent
aaclan-ebi commented 3 years ago

I was able to setup the state tracker, core, validator (broker, ui for easier testing) locally and was able to debug and replicate the issue.

Here are some of the things i've found/confirmed:

~1. There's a circular dependency between state tracker and core.~ ~Core is using http to send messages to the state tracker and would need the ip/url of the service.~ ~State tracker queries core with some info from submission envelope using uuid to load the data during its initialisation~ ~There is no issue in deploying the state tracker after core atm. The state tracker service ip seems to not change within the cluster, but i think this is not guaranteed.~ ~We probably would want to change core -> state tracker's sending of messages so that it wouldn't need the state tracker service/endpoint. It used to be via RabbitMQ topic exchange, but the order of messages wasn't guaranteed it was quickly changed to be http and I believe it's when the circular dependency was introduced. I believe we should be able to use RabbitMQ direct exchange if we want to make the order of messages guaranteed. This would also make the sending of events between our internal components consistent.~ (2021/02/15 Update ) I need to further check why I thought there's an issue if we deploy core first before the state tracker

  1. State tracker’s state machines are being persisted every minute. If there are errors in the loading of state machines during its initialisation/deployment it's going to get persisted. This is the most likely the reason why the state tracker data is corrupted/out of sync from ingest db.

  2. The data we store in Redis is the state machine context object and is not in a human readable format. If we're going to correct the data we could probably do it in the code during the state tracker's loading/initialisation.

Actions:

  1. Update the state tracker code to sync/correct the data during initialisation. This is urgent to proceed with exporting data.
  2. Critical errors like errors in loading state machines should be raised during state-tracker initialisation.
  3. Remove the cyclic dependency.
aaclan-ebi commented 3 years ago

The AutoLoader process in the state tracker ran successfully in dev:

https://app.zenhub.com/files/299258791/357914b6-69a7-40da-992d-656f2997cb86/download

The integration test is passing in dev: https://gitlab.ebi.ac.uk/hca/ingest-integration-tests/-/jobs/371351

aaclan-ebi commented 3 years ago

PR is merged: https://github.com/ebi-ait/ingest-state-tracking/pull/4 , deployed in staging The integration test is passing in staging https://gitlab.ebi.ac.uk/hca/ingest-integration-tests/-/pipelines/136211

aaclan-ebi commented 3 years ago

Released to prod today: https://github.com/ebi-ait/ingest-kube-deployment/blob/master/production/changelog.md#29-january-2021

https://embl-ebi-ait.slack.com/archives/CSDN3M2P5/p1615375886015000