Open joelverhagen opened 3 weeks ago
The LoadSymbolPackageArchive
driver was stuck in the Working
state.
It is blocked waiting for enqueue table scan to complete. https://github.com/NuGet/Insights/blob/56ee2eeccfcc93b93fc1ee8b6282b6aed625bef6/src/Worker.Logic/CatalogScan/CatalogIndexScanMessageProcessor.cs#L456
All of the leaf scans are complete, but 2 of the enqueue task states are left over. This should not be possible. This indicates that the table scan enqueue succeeded (because all of the leaf scans are done) but somehow the enqueue task states got recreated or failed to be deleted somehow.
The parameters for the leftover task state record are here:
{
"n": "tps.pq",
"v": 1,
"d": {
"sf": 1,
"ss": 1,
"d": 59,
"p": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-",
"m": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-czlovek.files",
"u": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-\uFFFF"
}
}
{
"n": "tps.pkq",
"v": 1,
"d": {
"sf": 1,
"ss": 1,
"d": 59,
"p": "08584776361820073217-gaxyddyi2cqubp5af23sl7m4ei-r100-B894-czlovek.files",
"r": "1.0.9"
}
}
Task state entities (2 left dangling):
No leaf scans left:
The task state entities are added before their corresponding message (which is desired) by perhaps a transient table or queue error, or slowness cause the problem. https://github.com/NuGet/Insights/blob/56ee2eeccfcc93b93fc1ee8b6282b6aed625bef6/src/Worker.Logic/TableScan/TableScanMessageProcessor.cs#L233-L238
The only hint I could from the existing telemetry is that some table batch operations failed with a server timeout. The storage no-op system kicked in, but seemed to be related to other buckets.
Next step: enhance telemetry around message enqueue and table entity changes so we can better understand what happened.
The LoadSymbolPackageArchive
driver is stuck again in the Working
state.
Similar to last time, there are no leaf scans left but some table scan task records.
Table name: taskstatedqcwugivcvqu5pbqns44odgk3m
Index scan started: 2024-09-04T10:39:01.5008431Z
Index scan timestamp (last updated): 2024-09-04T10:59:33.276884Z
First moved to the Working
state: 2024-09-04T10:59:33.2711682Z
PK: 08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-EnqueueCatalogLeafScans
RK: step-gkxlxqhuij4mqsfgmkz55lro4qwnvqkuknoro4lb5jgsm6lhcxbq
Timestamp: 2024-09-04T11:00:21.9243814Z
Client request ID: 894af93f-7a44-4ae1-83d7-9db87ea4b382
{
"n": "tps.pq",
"v": 1,
"d": {
"sf": 1,
"ss": 1,
"d": 59,
"p": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-",
"m": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-definitelytyped.net",
"u": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B661-\uFFFF"
}
}
bucketedpackages
table:
symbolpackagearchives
table:
This page was processed, because: definux.emeraude.presentation
1.0.10-beta~00
in the symbolpackagearchives
table was updated at 2024-09-04T11:13:47.5050676Z
(after the catalog index scan began)
Similarly definux.emeraude.resources
1.2.0.13~00
was updated at 2024-09-04T11:13:47.6952307Z
PK: 08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-EnqueueCatalogLeafScans
RK: step-xvnazptqsva5fiv2cxm6oiq6vfqs2u6w5umjyuksclh5fes2epiq
Timestamp: 2024-09-04T11:00:22.2452016Z
Client request ID: fba1936c-ed58-4aa9-9605-f2419e07b753
{
"n": "tps.pq",
"v": 1,
"d": {
"sf": 1,
"ss": 1,
"d": 59,
"p": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-",
"m": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-devrating.sqliteclient",
"u": "08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B658-\uFFFF"
}
}
bucketedpackages
table:
symbolpackagearchives
table:
devteam.testtool.contracts
1.0.108-beta~00
was updated at 2024-09-04T11:13:53.5514588Z
dexih.connections.postgressql
1.1.957~00
was updated at 2024-09-04T11:13:53.7346526Z
Only two records after the catalog index scan began.
Azure.Data.Tables.TableTransactionFailedException
while cleaning up leaf itemsTimestamp: 2024-09-04T11:05:18.304516Z
Invocation ID: daf4c0d0-db3b-4f30-ba43-4d986e901e17
Formatted message: [transient] An exception was thrown while processing a batch of 30 messages with schema cls.
Operation ID: 71e7c38c95cc718539c41908d2bfb3c1
This happened inside here (while cleaning up completed leaf scan records): https://github.com/NuGet/Insights/blob/ccd4da140bff99e24bccf5d105374bfb37c8a635/src/Worker.Logic/CatalogScan/CatalogLeafScanMessageProcessor.cs#L333
Azure.Data.Tables.TableTransactionFailedException
somewhere in CatalogScanStorageService
The call stack is truncated... not sure why.
Timestamp: 2024-09-04T11:05:18.3031638Z
Invocation ID: daf4c0d0-db3b-4f30-ba43-4d986e901e17
Operation ID: 71e7c38c95cc718539c41908d2bfb3c1
FormattedMessage: [transient] Batch failed due to HTTP 500, with storage suffix 'dqcwugivcvqu5pbqns44odgk3m', first partition key '08584761605445667694-dqcwugivcvqu5pbqns44odgk3m-r100-B636-boost_date_time-vc140', first row key '1.59.0'.
This is a duplicate of the previous.
The "timed reprocess" system sometimes gets stuck. For this reason, it's not ready to be enabled by default.