NuGet / Insights

Gather insights about public NuGet.org package data
Apache License 2.0
24 stars 7 forks source link

Mitigate table scan race condition when only queue messages are left #104

Closed joelverhagen closed 9 months ago

joelverhagen commented 9 months ago

Currently, the table scan service has a race condition.

If all of the exploration steps of the table scan are complete but the driver's work is not complete, the caller who started the table scan may incorrectly think that the table scan is done when it's really not.

This can happen when ITableScanDriver<T>.ProcessEntitySegmentAsync(...) has been called for all rows in the table (meaning all of the task state entities used to track the table scan work) are deleted but there are still a bunch of queue messages in the work or expand queue to complete.

The sample TableCopyDriver implementation has this problem because the driver just enqueues "copy" messages and the caller of the table scan service has no idea when all of these "copy" messages are done. To fix this, I added a check in the TableCopyDriverIntegrationTest to see if there are any messages in either the work or expand queue with a peek. This eliminated the test flakiness. But this doesn't work in production because there may be other operations running in parallel that keep the queue crowded with messages preventing the caller from seeing that all the messages are done.

The ProcessBucketRangeDriver implementation has this problem because it both creates catalog leaf scan items and enqueues messages for them. The caller checks that all of the catalog leaf scans are gone but it could potentially make this check after all of the table scan task state entities have been deleted but before any of the catalog leaf scan entities have been created yet, when at a moment when the number of catalog leaf scan entities reaches zero for a moment but there are still queue messages to process.

I don't think the EnqueueCatalogLeafScanDriver has this problem because all of the catalog leaf scans are created before the table scan begins and the caller checks both that the table scan is done and all catalog leaf scans are gone (which will never be true until all of the catalog leaf scans are actually done).

This could be fixed in a couple of ways that I can think of:

  1. Create a dedicated queue for table scan processing. The caller can easily check if that queue is empty. If TableScanMessage<T> messages are put on this dedicated queue, then we can probably do away with the task state entity interaction.
  2. Check the main work and expand queues for "very few" messages per x-ms-approximate-messages-count, or a peek, or even a test dequeue. This one feels hard because the call of the table scan service itself might be a queue message (e.g. a catalog index scan message). The queue will never have x-ms-approximate-messages-count = 0 if the caller is itself a queue message.
  3. Somehow check that there are no other workers processing messages. If no messages are returned by a queue peek and no workers are processing messages from a table scan, then you know the table scan is done.
joelverhagen commented 9 months ago

The problem still remains for table scan service in general, but I resolved it for ProcessBucketRangeDriver by making it into CopyBucketRangeDriver and creating all catalog leaf scans before doing an enqueueing. This is the same approach as the find latest flow.