PRX / cms.prx.org

CMS API for PRX
https://cms.prx.org
GNU Affero General Public License v3.0
4 stars 2 forks source link

Failed jobs #605

Open cavis opened 2 years ago

cavis commented 2 years ago

Saw some failing jobs yesterday, that caused a bunch of "SQS messages too old" alarms. Because the workers were dying, the messages were hidden for ~1 hour 20 times in a row.

I think these were a story that got deleted before the image callback and search indexer+deindexer jobs came in. I thought we were handling those GID failures everywhere, but maybe not. Look into it, because those alarms are noisy!

From the image-callback queue:

{
    "Time": "2022-04-13T18:05:26.848Z",
    "Timestamp": 1649873126.848,
    "JobResult":
    {
        "Job":
        {
            "Id": "gid://prx/StoryImage/784584"
        },
        "Execution":
        {
            "Id": "arn:aws:states:us-east-1:561178107736:execution:StateMachine-xeT5hO7gtTy9:2d85a4fd-8a5b-4301-ba5d-a3b38dd9e1c3"
        },
        "FailedTasks":
        [],
        "State": "DONE",
        "TaskResults":
        [
            {
                "Task": "Copy",
                "Mode": "AWS/S3",
                "BucketName": "production.mediajoint.prx.org",
                "ObjectKey": "public/piece_images/784584/LWoS_season02_Cover_v4_031522.png",
                "Time": "2022-04-13T18:05:23.726Z",
                "Timestamp": 1649873123.726
            },
            {
                "Task": "Inspect",
                "Inspection":
                {
                    "Size": 12330765,
                    "Extension": "png",
                    "MIME": "image/png",
                    "Image":
                    {
                        "Width": 2000,
                        "Height": 2000,
                        "Format": "png"
                    }
                }
            },
            {
                "Task": "Image",
                "BucketName": "production.mediajoint.prx.org",
                "ObjectKey": "public/piece_images/784584/LWoS_season02_Cover_v4_031522_square.png",
                "Time": "2022-04-13T18:05:25.885Z",
                "Timestamp": 1649873125.885
            },
            {
                "Task": "Image",
                "BucketName": "production.mediajoint.prx.org",
                "ObjectKey": "public/piece_images/784584/LWoS_season02_Cover_v4_031522_small.png",
                "Time": "2022-04-13T18:05:25.967Z",
                "Timestamp": 1649873125.967
            },
            {
                "Task": "Image",
                "BucketName": "production.mediajoint.prx.org",
                "ObjectKey": "public/piece_images/784584/LWoS_season02_Cover_v4_031522_medium.png",
                "Time": "2022-04-13T18:05:26.392Z",
                "Timestamp": 1649873126.392
            }
        ]
    }
}

And 2 in the search indexer queue:

{
    "job_class": "SearchIndexerJob",
    "job_id": "c104191c-8439-42b2-8d91-e8fe639c56e6",
    "queue_name": "dc51b3fd_prod_cms_search_indexer",
    "arguments":
    [
        {
            "_aj_globalid": "gid://prx/Story/416310"
        }
    ],
    "locale": "en"
}
{
    "job_class": "SearchDeindexerJob",
    "job_id": "1016a8dc-5337-4baf-82c0-bdc8cd79a478",
    "queue_name": "dc51b3fd_prod_cms_search_indexer",
    "arguments":
    [
        "Story",
        416310
    ],
    "locale": "en"
}