NuGet / Insights

Gather insights about public NuGet.org package data
Apache License 2.0
24 stars 7 forks source link

Timed reprocess system sometimes gets stuck #108

Open joelverhagen opened 3 weeks ago

joelverhagen commented 3 weeks ago

The "timed reprocess" system sometimes gets stuck. For this reason, it's not ready to be enabled by default.

joelverhagen commented 3 weeks ago

2024-08-18

The LoadSymbolPackageArchive driver was stuck in the Working state. image

It is blocked waiting for enqueue table scan to complete. https://github.com/NuGet/Insights/blob/56ee2eeccfcc93b93fc1ee8b6282b6aed625bef6/src/Worker.Logic/CatalogScan/CatalogIndexScanMessageProcessor.cs#L456

All of the leaf scans are complete, but 2 of the enqueue task states are left over. This should not be possible. This indicates that the table scan enqueue succeeded (because all of the leaf scans are done) but somehow the enqueue task states got recreated or failed to be deleted somehow.

The parameters for the leftover task state record are here:

{
    "n": "tps.pq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-",
        "m": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-czlovek.files",
        "u": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-\uFFFF"
    }
}
{
    "n": "tps.pkq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-czlovek.files",
        "r": "1.0.9"
    }
}

Task state entities (2 left dangling): image

No leaf scans left: image

The task state entities are added before their corresponding message (which is desired) by perhaps a transient table or queue error, or slowness cause the problem. https://github.com/NuGet/Insights/blob/56ee2eeccfcc93b93fc1ee8b6282b6aed625bef6/src/Worker.Logic/TableScan/TableScanMessageProcessor.cs#L233-L238

The only hint I could from the existing telemetry is that some table batch operations failed with a server timeout. The storage no-op system kicked in, but seemed to be related to other buckets.

Next step: enhance telemetry around message enqueue and table entity changes so we can better understand what happened.

joelverhagen commented 1 week ago

2024-09-04

The LoadSymbolPackageArchive driver is stuck again in the Working state. image

Similar to last time, there are no leaf scans left but some table scan task records. Table name: taskstatedqcwugivcvqu5pbqns44odgk3m image

Index scan started: 2024-09-04T10:39:01.5008431Z Index scan timestamp (last updated): 2024-09-04T10:59:33.276884Z First moved to the Working state: 2024-09-04T10:59:33.2711682Z


PK: 08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-EnqueueCatalogLeafScans RK: step-gkxlxqhuij4mqsfgmkz55lro4qwnvqkuknoro4lb5jgsm6lhcxbq Timestamp: 2024-09-04T11:00:21.9243814Z Client request ID: 894af93f-7a44-4ae1-83d7-9db87ea4b382

{
    "n": "tps.pq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-",
        "m": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-definitelytyped.net",
        "u": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-\uFFFF"
    }
}

bucketedpackages table:

image

symbolpackagearchives table: image

This page was processed, because: definux.emeraude.presentation 1.0.10-beta~00 in the symbolpackagearchives table was updated at 2024-09-04T11:13:47.5050676Z (after the catalog index scan began)

Similarly definux.emeraude.resources 1.2.0.13~00 was updated at 2024-09-04T11:13:47.6952307Z


PK: 08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-EnqueueCatalogLeafScans RK: step-xvnazptqsva5fiv2cxm6oiq6vfqs2u6w5umjyuksclh5fes2epiq Timestamp: 2024-09-04T11:00:22.2452016Z Client request ID: fba1936c-ed58-4aa9-9605-f2419e07b753

{
    "n": "tps.pq",
    "v": 1,
    "d": {
        "sf": 1,
        "ss": 1,
        "d": 59,
        "p": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-",
        "m": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-devrating.sqliteclient",
        "u": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-\uFFFF"
    }
}

bucketedpackages table:

image

symbolpackagearchives table: image

devteam.testtool.contracts 1.0.108-beta~00 was updated at 2024-09-04T11:13:53.5514588Z

dexih.connections.postgressql 1.1.957~00 was updated at 2024-09-04T11:13:53.7346526Z

Exceptions

Only two records after the catalog index scan began.

Azure.Data.Tables.TableTransactionFailedException while cleaning up leaf items

image

Timestamp: 2024-09-04T11:05:18.304516Z Invocation ID: daf4c0d0-db3b-4f30-ba43-4d986e901e17 Formatted message: [transient] An exception was thrown while processing a batch of 30 messages with schema cls. Operation ID: 71e7c38c95cc718539c41908d2bfb3c1

This happened inside here (while cleaning up completed leaf scan records): https://github.com/NuGet/Insights/blob/ccd4da140bff99e24bccf5d105374bfb37c8a635/src/Worker.Logic/CatalogScan/CatalogLeafScanMessageProcessor.cs#L333

Azure.Data.Tables.TableTransactionFailedException somewhere in CatalogScanStorageService

The call stack is truncated... not sure why. image

Timestamp: 2024-09-04T11:05:18.3031638Z Invocation ID: daf4c0d0-db3b-4f30-ba43-4d986e901e17 Operation ID: 71e7c38c95cc718539c41908d2bfb3c1 FormattedMessage: [transient] Batch failed due to HTTP 500, with storage suffix 'dqcwugivcvqu5pbqns44odgk3m', first partition key '08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B636-boost_date_time-vc140', first row key '1.59.0'.

This is a duplicate of the previous.