chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
63 stars 12 forks source link

As a developer, I want WMG to have more logging, monitoring, alerts so that the system is safe to operate and debug in the production environment #4691

Closed prathapsridharan closed 1 year ago

prathapsridharan commented 1 year ago

The tasks for this story are roughly captured in Observability, Operations and Safety portion of the design doc and further detail can be found in:

1) Metrics and Alerts 2) Validation

Acceptance Criteria: System checks that the dataset to read exists before deployment; System can rollback; Alerts fire for alertable events Parallelizable: Yes. Implementation Work: CI hook to check dataset exists; Alerts setup; Testing rollback and alerting; Logging and Monitoring

prathapsridharan commented 1 year ago

UPDATE

The most important things to do for this ticket are the following:

  1. Test rollback/rollforward capabilities of the WMG API by passing the required data schema version and snapshot id to rollback/rollforward to load_snapshot()
  2. Add logging to the code that read the snaphshot and writes the snapshot so that we can debug production issues by looking at the logs. Because the pipeline takes several hours, having logs to help us identify a problem is crucial.

Metrics, Alerts, and Validation as specified in the design doc can be punted

joyceyan commented 1 year ago

Steps for testing backwards compatibility in rdev:

  1. Create a new branch joyce/wmg-version-test-rdev. Add some superficial commit and create rdev stack from that commit with
    happy create wmg-version-test --tag sha-5a8867a --create-tag=false --skip-check-tag
  2. Copy the snapshots directory from dev into rdev
    aws s3 sync s3://cellxgene-wmg-dev/snapshots/ s3://env-rdev-wmg/wmg-version-test/snapshots/ --profile single-cell-dev
  3. Copy an old snapshot from dev into rdev
    aws s3 sync s3://cellxgene-wmg-dev/1688676649/ s3://env-rdev-wmg/wmg-version-test/snapshots/v1/old-snapshot-id/ --profile single-cell-dev
  4. Go to https://wmg-version-test-frontend.rdev.single-cell.czi.technology and check chrome tools to verify that it's reading snapshot_id: 1689721444
  5. Put up a new commit setting WMG_API_FORCE_LOAD_SNAPSHOT_ID = old-snapshot-id. Update the rdev stack with
    happy update wmg-version-test --tag sha-82b9a40 --create-tag=false --skip-check-tag
  6. Go to https://wmg-version-test-frontend.rdev.single-cell.czi.technology and check chrome tools to verify that it's reading snapshot_id: old-snapshot-id