DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Automate detection of GitLab migration failures #5308

Open dsotirho-ucsc opened 1 year ago

dsotirho-ucsc commented 1 year ago

Based on Abraham's idea of using the CloudWatch logs to identify GitLab migration errors.

hannes-ucsc commented 1 year ago

Spike to find message indicative of a successful migration. The most recent failed migration occurred before we started exporting GitLab logs to CloudWatch so we won't find an example of that. What we need here is supporting evidence for someone to craft a filter. We either need to be able to predict what a message for a failed migration would look like, or use a combination of filters. For example, if we knew the message for a successful migration was Migration status: ok then the filter would look for messages where the string Migration status: is not followed by the string ok.

achave11-ucsc commented 1 year ago

This might help, from a recent update to GitLab prod.

filter @logStream = '/mnt/gitlab/logs/gitlab-rails/migrations.log'
| sort time desc
| limit 100
| fields time, message, class, current_iteration, severity
[
    {
        "time": "2023-06-21T21:28:40.710Z",
        "message": "Migration finished",
        "class": "AddTextLimitOnOrganizationName",
        "current_iteration": "1",
        "severity": "INFO"
    },
    {
        "time": "2023-06-21T21:28:40.694Z",
        "message": "Lock timeout is set",
        "class": "AddTextLimitOnOrganizationName",
        "current_iteration": "1",
        "severity": "INFO"
    },
    {
        "time": "2023-06-21T21:28:40.127Z",
        "message": "Migration finished",
        "class": "RemoveCiTriggersRefColumn",
        "current_iteration": "1",
        "severity": "INFO"
    },
    {
        "time": "2023-06-21T21:28:40.115Z",
        "message": "Lock timeout is set",
        "class": "RemoveCiTriggersRefColumn",
        "current_iteration": "1",
        "severity": "INFO"
    },
    {
        "time": "2023-06-21T21:28:39.984Z",
        "message": "Migration finished",
        "class": "DropUnusedSequenceByRecreatingVsaTable",
        "current_iteration": "1",
        "severity": "INFO"
    },
    {
        "time": "2023-06-21T21:28:39.963Z",
        "message": "Lock timeout is set",
        "class": "DropUnusedSequenceByRecreatingVsaTable",
        "current_iteration": "1",
        "severity": "INFO"
    }
]

… there were 70 records matching this Insights query.

dsotirho-ucsc commented 1 year ago

@hannes-ucsc: "We might be able to find the exact message that indicates failure by looking at what other people have posted on the Interwebs about the failed migrations on GitLab. I'll do that."

hannes-ucsc commented 1 year ago

Could not find any examples of log entries for failed migrations in migrations.log. We'll have to wait for a migration to actually fail to see what we could search for. For now, we'll stick with the current manual process of looking for failed migrations.

dsotirho-ucsc commented 1 year ago

@hannes-ucsc: "I'll be continuing to look for failed migrations as part of the GitLab update PRs. When I find a failure I'll will put this issue back to Triage for further investigation."