Investigate: Zillions of new items added by sweep indexer to opinions corpus

mlissner commented 7 months ago

This is a bit surprising. The sweep indexer is running right now and is currently processing opinions (it already did clusters). It added millions of new items:

I'm curious what this is about. Maybe...

... @flooie is adding millions of items to the DB right now and it's a coincidence?
...we're not adding the last bulk data content to Elastic as we add the items to the DB?
...we missed a lot of content while indexing opinions the first time?

I'm not sure, but I thought I better flag it. I'm also quite surprised by the number of items in the index. The index has every opinion and every cluster, right? Is the sum of those two things 24M items?? Seems off?

mlissner commented 7 months ago

I guess I'll also note that this is sort of the sweep indexer working as designed — It's finding stuff! — but that ideally we wouldn't miss all this stuff in the first place (assuming we intend for this content to be in Elastic and it's not some weird fluke).

mlissner commented 7 months ago

Another 10 hours and we're up to 28M items in the opinion index. Hmmm.

mlissner commented 7 months ago

The sweep indexer moved on to dockets. Here's the final chart:

I think this should be opinion count + cluster count, right?

flooie commented 7 months ago

yeah but we aren't doing the bulk import at the moment

mlissner commented 7 months ago

One option off the table!

albertisfu commented 7 months ago

Yes, the numbers definitely seem off.

I've done some testing locally and haven't encountered this problem when checking counts using the API. However, I don't have a Kibana instance set up.

Could you please run the following queries?

GET /opinion_index/_count

{
   "query":{
      "match_all":{

      }
   }
}

GET /opinion_index/_count
{
   "query":{
      "match":{
         "cluster_child":"opinion"
      }
   }
}

GET /opinion_index/_count
{
   "query":{
      "match":{
         "cluster_child":"opinion_cluster"
      }
   }
}

This way, we can confirm the actual document number, and from there, we can discard the indexing process if the numbers are accurate.

mlissner commented 7 months ago

{
  "count": 27944828,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}

{
  "count": 18921691,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}

{
  "count": 9023137,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}

mlissner commented 7 months ago

The opinion count is currently:

In [2]: Opinion.objects.count()
Out[2]: 9720563

mlissner commented 7 months ago

Cluster count is less in the DB too, but that seems like a cleanup issue (maybe the sweep indexer should also look for things to delete?):

In [4]: OpinionCluster.objects.count()
Out[4]: 9019718

albertisfu commented 7 months ago

The origin of this issue was a bug in assigning the document routing_id during the bulk indexing of Opinions by the sweep indexer. Since the routing_id was incorrect, the document was looked up in the wrong shard, and since it wasn't found there, it was created there, resulting in a duplicate instead of updating the document with the correct routing_id.

This is not impacting search since the duplicates are orphaned child documents that can't be displayed on the frontend without a parent, but we need to clean them up. I tried to use a search script to compare the _routing_id and the cluster_id in an Opinion, so if they differ, it should be removed. However, it's not possible to access the_routing_id within an Elasticsearch script.

So, the alternative is to query documents with a timestamp from the last 72 hours and remove them. Then, we'd need to re-index Opinion Documents that don't exist in the index.

albertisfu commented 7 months ago

Today, we applied the command to remove duplicated opinions using the delete-by-query API.

The first attempt was run around 3:30 PM UTC, but the task failed due to a ConflictError, removing only 3,346,999 documents. The second attempt also failed, this time due to an es_rejected_execution_exception, as the cluster's task queue was full. It removed an additional 1,392,000 documents.

On the monitoring panel, we observed that the deletion of 4,738,999 documents finished around 4 PM UTC. The document count decreased according to this number; however, it took a bit more time for the storage to show an actual decrease.

An hour later, we noticed that search performance had significantly slowed. Even Kibana was unresponsive, and the CPU and memory usage of all the ES nodes reached 100%.

It seems the issue was related to the subsequent process that Elasticsearch performs after document removal, like merging shards to actually free up space.

According to the documentation: Smaller segments are periodically merged into larger ones to keep the index size manageable and to expunge deleted documents. The merge process uses auto-throttling to balance hardware resource usage between merging and other activities, like search.

There is this setting that controls the number of threads per shard used for merging segments: index.merge.scheduler.max_thread_count which defaults to: Math.max(1, Math.min(4, <<node.processors, node.processors>> / 2))

Which documentation refers to this as a good number for SSD disks, which is our case.

The problem might have been caused by running the sweep indexer at the same time, which had an indexing rate of 200-250 docs/sec.

After stopping the sweep_indexer, node resources began to normalize, and search functionality speed up a few minutes later. Therefore, the combination of workloads appeared to be the issue.

Things that we can consider now:

Before restarting the sweep_indexer, we should ensure the nodes are in good shape and the maintenance process has finished.
Before continuing with document removal, we need to tweak the request to add throttling. We can slow down the process to complete it over a week or so.
Once we have Prometheus monitoring the ES cluster, I believe we'll better understand the cluster's limits and use this information to make informed decisions before running maintenance or indexing tasks that could potentially impact the cluster's overall availability.

mlissner commented 7 months ago

@albertisfu or @blancoramiro, can one of you provide an update here? IIRC, we have done the deletion, but it was overinclusive, so we need to add another script to add the data back in, right?

albertisfu commented 7 months ago

Sure, based on the rate at which we're deleting items right now, I estimate it will be finished by Friday. Then, we can run the next script to index missing data. I think @blancoramiro could provide us a better update on the progress of the deletion task.

mlissner commented 7 months ago

Oh, right, silly me. I forgot we throttled this to take a week. I'll chill out!

blancoramiro commented 7 months ago

Hey guys. That is correct the task's still running, ~55% was done "total": 4873278 "deleted": 2716969


{  "completed": false,
  "task": {
    "node": "nmLgdM-aSDCkZNmyVoOodg",
    "id": 666370987,
    "type": "transport",
    "action": "indices:data/write/delete/byquery",
    "status": {
      "total": 4873278,
      "updated": 0,
      "created": 0,
      "deleted": 2716969,
      "batches": 2717,
      "version_conflicts": 31,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 339499996,
      "requests_per_second": 8,
      "throttled_until_millis": 71972
    },
    "description": "delete-by-query [opinion_index]",
    "start_time_in_millis": 1711148456329,
    "running_time_in_nanos": 341812895244561,
    "cancellable": true,
    "cancelled": false,
    "headers": {}
  }
}

mlissner commented 7 months ago

The latest document count:

Looks about like I'd expect!

albertisfu commented 6 months ago

Regarding the possibility of scheduling segment merges in elasticsearch to avoid disrupting the cluster stability I read more about the topic and here are my findings.

The documentation about Merges in ES only mentions the index.merge.scheduler.max_thread_count setting: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html

Which can be used to set a max number of threads to be used for merges and defaults to the formula: Math.max(1, Math.min(4, <<node.processors, node.processors>> / 2))

Which means for each shard that:

At least 1 thread is a used.
No more than 4 threads are used.
If the node has fewer than 8 processors, the number of threads is equal to half of the processors.

This is a setting at the index level, so the throttling does not prevents using more threads than max_thread_count if merging is happening in two or more indices at the same time.

However there are additional settings that can be tweaked as can be seen directly in the MergePolicyConfig code:

index.merge.policy.expunge_deletes_allowed

When forceMergeDeletes is called, we only merge away a segment if its delete percentage is over this threshold. Default is 10.

index.merge.policy.floor_segment

Segments smaller than this are "rounded up" to this size, i.e. treated as equal (floor) size for merge selection. This is to prevent frequent flushing of tiny segments, thus preventing a long tail in the index. Default is 2mb.

index.merge.policy.max_merge_at_once

Maximum number of segments to be merged at a time during "normal" merging. Default is 10.

index.merge.policy.max_merged_segment

Maximum sized segment to produce during normal merging (not explicit force merge). This setting is approximate: the estimate of the merged segment size is made by summing sizes of to-be-merged segments (compensating for percent deleted docs). Default is 5gb.

index.merge.policy.segments_per_tier

Sets the allowed number of segments per tier. Smaller values mean more merging but fewer segments. Default is 10. Note, this value needs to be >= than the max_merge_at_once otherwise you'll force too many merges to occur.

index.merge.policy.deletes_pct_allowed

Controls the maximum percentage of deleted documents that is tolerated in the index. Lower values make the index more space efficient at the expense of increased CPU and I/O activity. Values must be between 5 and 50. Default value is 20

When merges occur?

Here is a pretty good article that illustrates how merging segments works: https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

It actually describes how merging segments looks using TieredMergePolicy which is the policy used by Elasticsearch nowadays.

In brief, a segment is the basic data structure that allows documents to be searchable and they're inmutable. When indexing a new document it’s not immediately written into a segment. Instead a buffer of documents are hold into memory until the buffer is full or when a refresh occurs (whichever it happens first). At that point, the set of documents is written to disk as a new segment. The issue with segments is that they fragment the index into several small parts, negatively affecting search performance since every segment in the index must be queried to match documents. This is where the process of merging segments comes into play to solve the fragmentation issue, improve search performance, and free disk space from deleted documents.

The merge segment process happens automatically when the number, size, and composition of segments meet the TieredMergePolicy and the current settings, optimizing the index segments.

There is no setting to schedule merge segments to occur at a specific time, as that could be detrimental to the index. Accumulating too many small segments would make the search process inefficient, and disk usage would grow significantly. At the same time, the accumulation of segments would make the merge process costly and slow when it finally occurs.

However the settings described above allow customization of how often merges occur, which can help achieve the right balance of merges based on the cluster and index characteristics. These settings should only be adjusted after understanding how the merging segments are currently performing in the cluster and considering our requirements regarding indexing rate, index size, number of shards, search rate, etc.

The problem doesn't seem to be with merges per se, but with how and when they occur. The issue specifically arose when running the sweep indexer and removing documents in operations happening in two different indices.

That’s why throttling the deletes didn't help because merging segments related to document removal is controlled by deletes_pct_allowed, which represents the percentage of deleted documents tolerated in the index before performing merges to free up disk space. That explain why throttling didn't help to prevent the cluster disruption since even thought deletion process was slow merge process didn't occur until it reached the threshold of 20% of accumulated documents deleted in the Opinion Index.

A contributing factor was that the merge process occurred simultaneously with regular merges in the RECAP index, where merges were frequent due to the sweep indexer rapidly adding/updating documents, leading to constant merges. Updating a document results in two operations: a deletion of the old document and an addition of the new document version. Therefore, even if the task does not explicitly delete documents, merges that actually remove documents from the index occur.

So the problem was two merge process were occurring simultaneously, each using between 1 to 4 threads per shard, leading to a doubling of the allowed threads set by max_thread_count, as this setting functions at the index level.

To avoid the issue, we could:

Avoid performing merges on two indices simultaneously, as operations won't be throttled at the node level but will use more threads than set in max_thread_count. This implies that the sweep indexer, which seems to trigger many merge operations due to a high indexing rate, should not be running. Regarding indexing signals, given their slower indexing rate, related merges are less likely to occur frequently or simultaneously with merges following a deletion. However, resource monitoring would be necessary to determine if the cluster can handle both processes simultaneously. To be entirely sure, we could also disable indexing signals while the deletion process is underway.
Consider adjusting the deletes_pct_allowed setting. If opting for a slower deletion process, reducing the deletes_pct_allowed threshold could be beneficial, as it is currently possible to set it from 5% to 50%, with the default at 20%. Lowering this setting would result in merges occurring more frequently, rather than waiting for a larger merge upon reaching 20% of documents deleted in the index. Alternatively, increasing deletes_pct_allowed to a higher value, such as 50%, can be a good option if we want to remove documents faster, enabling the deletion of 50% of documents in a single query without throttling, then waiting for merges to occur before processing the remaining 50%. The best combination for the cluster will depend on its configuration and hardware characteristics; thus, testing and monitoring may be required. We should also consider the overhead introduced by merges due to write amplification. Here we we can see more details and numbers related to different percentages can be found here: https://github.com/apache/lucene/issues/11761#issuecomment-1251264246

Refreshing:

Since refreshing controls how often the document changes buffer is written to disk, creating a new segment. The refresh interval can significantly impact how often merges occur. By default, refreshing happens every second, so once the new segments in the index meet the merge policy based on current settings, merges occur. So once we have good resource monitoring and better visibility into merge stats, we could analyze whether delaying refreshing could be beneficial. This would mean fewer, but larger, merges, potentially finding a good balance that frees resources and improves search speed.

Force merge and large segments

On a index that is not longer receiving writes is possible to use the Force merge API to optimize the index. On index dynamic index that receives writes this is not recommended since the force merge API likely is going to produce segments bigger than max_merged_segment (default 5GB) so that normal merges will ignore these segments and deleted documents on there never will be expunged unless max_merged_segment is increased but it could lead to expensive merges.

Number of shards

Reading about this topic I found this discussion: https://discuss.elastic.co/t/merge-policy-tuning-hint/10070/2

Where a user was dealing with merge issues due to overhead in their cluster resources. The user had around 30 shards per index, totaling approximately 250 shards per node. It was recommended to reduce the number of shards, which would decrease the overhead during merges. While our current shard count isn't as excessive per node, this is consideration could be useful in future strategies to enhance cluster performance.

Reducing the number of shards can be done by using the Shrink API: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-shrink-index.html#indices-shrink-index API

mlissner commented 6 months ago

So the problem was two merge process were occurring simultaneously, each using between 1 to 4 threads per shard, leading to a doubling of the allowed threads set by max_thread_count, as this setting functions at the index level.

:boom:

Got it, this is helpful analysis. So we have two options we can do separately or together:

Don't do the sweep (and possibly the signals) while trying to delete stuff.
Tweak the deletes_pct_allowed value to more aggressively merge.

I kind of feel like I'm inclined to try the deletion again with sweep disabled and signals doing their thing. Disabling signals seems painful, but the sweep is already off.

We could also tweak the variable if that's easy. I guess it'd help.

Having done so much analysis, what's your suggestion, Alberto?

albertisfu commented 6 months ago

Yes, I agree we can proceed to remove the remaining duplicated Opinions now that the sweep indexer is disabled.

According to the last count, we have 19.2M documents in the index. We still need to remove 937,368 Opinions, according to the document count when we canceled the removal task last time.

That means the remaining documents to delete constitute around 5% of the index, so even if they are removed since the current deletes_pct_allowed is 20%, it is probable that a merge to expunge removed documents won't be scheduled soon.

To confirm that @blancoramiro helped me run the _stats command:

GET opinion_index/_stats

And it showed that the current number of deleted documents in the Opinion Index is:


"primaries": {
      "docs": {
        "count": 19174205,
        "deleted": 3061774
      }

That's a percentage of 15.9%.

So, after we remove the remaining 937,368, that percentage will increase to 20.85%, which is slightly greater than the default deletes_pct_allowed of 20%. Therefore, we expect that a merge process will be triggered after reaching 20%. Otherwise, if we don't see the merge triggered, it might be required to reduce the deletes_pct_allowed a bit. We can determine if the merge process has already happened by inspecting the index disk usage (which should be reduced) and from the _stats API.

Besides this deletion process, we still have the issue that, some other day in the future, the percentage of deletions is again reached due to document updates. Considering the sweep indexer will always be running, it is possible that we can be affected by merges from two different indices happening at the same time.

So, to solve the problem in the long term, we should ensure that when a merge that expunges deletes happens alongside other normal merges in a different index, resources are not overwhelmed.

We could just tweak max_thread_count to use only half of the current threads. For instance, if our nodes have 8 or more processors, that means the current number of threads used during a merge process in an index is 4, so we can set it to 2 or 1 on the Opinion index. We could also do this better by monitoring the _stats API to predict when we're close to reaching deletes_pct_allowed and only around that time, we can reduce max_thread_count in the index so the merge that expunge deletes (which are the heaviest ones) uses fewer resources while normal merges and the ones triggered by the sweep indexer continues running normally.

So, for now, it seems that the next step is to remove the remaining Opinions:

manage.py cl_remove_content_from_es --action opinions-removal --start-date 2024-03-15 --end-date 2024-03-18 --requests-per-second 150

If we use a rate of 150 documents per second, it'll take around 2 hours to complete the task for the remaining ~937,368documents. After that, we expect the merge process to be triggered and the disk space to be freed up.

mlissner commented 6 months ago

Sounds great. If you and Ramiro want to work together to tweak the settings too, as you outline, that makes sense to me. Do we want to lower the deletes_pct_allowed to maybe 10% instead, so that when the process starts (using fewer threads), it can complete more quickly? 20% seems like a really high value, no?

albertisfu commented 6 months ago

Sure, here are the settings we need to tweak:

We should first adjust the max_thread_count so that if a merge is triggered immediately after we reduce the deletes_pct_allowed, it uses the new max_thread_count.

@blancoramiro, could we please confirm the number of processors the master-data nodes have before proceeding? If they have 8 or more processors, that means the current merge process is using 4 threads, so we can reduce them to half. If the nodes have fewer than eight processors, maybe we should reduce them to 1.

PUT /opinion_index/_settings
{
  "index": {
    "merge": {
      "scheduler": {
        "max_thread_count": 2
      }
    }
  }
}

To reduce deletes_pct_allowed to 10% we can do:

PUT /opinion_index/_settings
{
  "index": {
    "merge": {
      "policy": {
        "deletes_pct_allowed": 10.0
      }
    }
  }
}

To confirm that the settings were correctly applied, we can verify by checking:

GET /opinion_index/_settings

mlissner commented 6 months ago

Ramiro, I put this on your backlog as a "Next Task." No need to interrupt your other work, but when you're done with your current work, this will be...next. :)

blancoramiro commented 6 months ago

Hey @albertisfu cluster has 6 nodes r6g.xlarge, 4 vCPU 32 GIB. So I am guessing 1 thread max would do. We can coordinate together when to do it.

albertisfu commented 6 months ago

@blancoramiro thanks. Yeah based on the default setting Math.max(1, Math.min(4, <<node.processors, node.processors>> / 2)) that means that currently 2 threads are being used on merges.

So yeah we should set it to 1:

PUT /opinion_index/_settings
{
  "index": {
    "merge": {
      "scheduler": {
        "max_thread_count": 1
      }
    }
  }
}

We can coordinate together when to do it.

Sure, let me know when you're available.

albertisfu commented 6 months ago

Here an update of this process.

First @blancoramiro applied the settings we agreed: "max_thread_count": 1 and "deletes_pct_allowed": 10.0

After reducing deletes_pct_allowed to 10%, a merge process began in the opinion_index because the current delete percentage was around 15%.

2 (1)

1 (1)

As confirmed by the screenshots, the merge process recovered disk space, and the number of segments was reduced.

We waited until the merge process finished and the disk space and segment count stabilized.

4 (1) 3 (2)

At this point, we could confirm that the number of deleted documents decreased to 936258:

"docs": {
        "count": 19174817,
        "deleted": 936258
      }

This represents around 5% of the deletes in the index.

Then, we started the removal task to delete the remaining 937368 opinions from the index at a rate of 150 docs/sec.

However, after some minutes, we noticed the CPU spiked, and the search pages slowed down. So, we decided to cancel the removal task.

Ramiro noticed that the scheduled tasks in the cluster were piling up. To allow the cluster to process the enqueued tasks faster, we increased the refresh rate in the recap_vector index to 10 seconds.

This change helped the tasks to process faster, and the cluster normalized.

Afterward, we confirmed that no merge process was run during this period, so the issue was directly related to the number of tasks being too high for the cluster to handle. Even though the sweep indexer was not running, the combination of indexing from signals + search + tasks related to the removal of documents was too much for the cluster.

recap_vectors load during the same period:

So, after the CPU usage in the cluster normalized, we decided to rerun the removal task but at a slower rate (10 docs/sec). At this rate, the task will finish in 24 hours, and so far, the CPU and search latencies seem normal.

Additionally, we decided to set deletes_pct_allowed back to 20%, so once the percentage of removed documents hits 10%, it's not automatically triggered. Tomorrow, once the removal task finishes, we can set this setting to 10% and monitor how the merge process behaves.

mlissner commented 6 months ago

Love this write up and the teamwork. Alberto, you're becoming a true Elastic expert. You're in the guts now!

albertisfu commented 6 months ago

Thanks to @blancoramiro the removal of the duplicated Opinions is now completed.

After confirming the completion of the removal task, we checked the percentage of deletes in the index, which was 9.4%.

To trigger a merge and expunge the deletes from the index, we decreased deletes_pct_allowed to 9%.

The merge was triggered, resulting in recovered disk space and reduced segments: Screenshot at 2024-04-11 19-26-46

After that we set deletes_pct_allowed in the Opinion index back to 10%.

Now, we only need to run a re-index of Opinions to restore those documents that might not have been duplicates and were removed because they were missed in the first indexing.

We can do this using the cl_index_parent_and_child_docs command with the missing flag, so only Opinions that don't exist in the index are indexed.

However, using the missing flag will skip updates that could have been lost due to signal errors. And due to the last indexing triggering duplicates instead of updating those documents, they are outdated.

So, @mlissner, please let me know which option sounds better to you: doing only missing documents now and waiting for the sweep indexer to revisit Opinions (however, we will need to run it again without the missing flag too) or indexing everything now for Opinions.

mlissner commented 6 months ago

Let's get this behind us and be done with this problem. What's the fastest way to be done here and re-enable the sweep indexer in a way we can walk away from it all?

albertisfu commented 6 months ago

Sure, the easiest way would be to use the cl_index_parent_and_child_docs command for indexing only opinions. It would be something like this:

manage.py cl_index_parent_and_child_docs --search-type o --document-type child --pk-offset 0 --chunk-size 50 --queue

We would need to use a different queue than the sweep indexer by setting --queue because I recall the sweep indexer is currently stopped, as it has 0 workers assigned.

mlissner commented 6 months ago

Great. Ramiro, do you want to take a stab at running this on a maintenance pod and making sure it continues to run? I think we'll want to use the etl_tasks queue, and maybe a smaller chunk-size — 20, perhaps?

blancoramiro commented 6 months ago

Sure. So if I got correctly the command will be:

manage.py cl_index_parent_and_child_docs --search-type o --document-type child --pk-offset 0 --chunk-size 20 --queue etl_tasks

albertisfu commented 6 months ago

Yeah the command is correct!

blancoramiro commented 6 months ago

Command is running. Will keep you updated.

albertisfu commented 6 months ago

A status of the Indexing Task:

Initial Opinions count before starting the new indexing: 9,207,568. Initial OpinionCluster count: 9,031,284

Total documents in the index before start the indexing: 18,238,852

Last status of the task: Processed 4990920/9728669, (51%), last PK indexed: 5172455

Thus, it is expected that 9,728,669 opinions will be in the index upon completion of the task.

That accounts for approximately 521,101 new opinions indexed.

Currently with the 51% of advance we're at 18.5M: Screenshot at 2024-04-12 20-12-30

Once the task is finished, we expect to reach ~18.7M opinions indexed, matching the expected numbers.

blancoramiro commented 6 months ago

Hello Alberto. I am just letting you know that the command completed on Saturday:

Successfully indexed 9728667 items from pk 0.
Successfully indexed 9728668 items from pk 0.
Processed 9728669/9728669, (100%), last  PK indexed: 9959827,
Successfully indexed 9728669 items from pk 0.

opinion_cluster:
{ 
  "count": 9031499,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}
opinion:
{
  "count": 9732356,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}

Tot: 18763855

albertisfu commented 6 months ago

Thanks @blancoramiro ! The final count of Opinion Documents looks great.

There are approximately 3,687 more opinions than expected, but I believe those are related to #3967, so we can consider this one finished!

mlissner commented 6 months ago

I think this means we can re-enable the sweep indexer, right?

albertisfu commented 6 months ago

@mlissner yeah exactly, we can now re-enable the sweep indexer

mlissner commented 6 months ago

Wonderful. Done, and it's off to the races again with one worker sweeping through.

albertisfu commented 6 months ago

Since the sweep indexer started indexing RECAPDocuments, we noticed that the number of newly indexed documents was growing very fast, even though the document duplication bug due to routing IDs had already been resolved.

image (7)

We performed an investigation to determine if the indexer was still duplicating documents, and after analyzing a small sample of newly indexed RECAPDocuments, we found no duplicates related to them

After re-enabling the sweep indexer to continue indexing RECAPDocuments, we noticed that the document count continued to grow rapidly.

image (6)

At this point, the sweep indexer had processed RECAPDocuments within the following range: Start ID: 21412006 Final ID: 22410942

This means that 998,936 documents were processed, and out of this number, 450K were new documents indexed, which indicated that around 50% of the documents were new, which seemed excessive to continue indexing new documents at this growing rate.

Therefore, we wanted to confirm that there was nothing else wrong that could be leading to an undetected duplication issue. Another theory we wanted to confirm is whether we missed a significant number of documents that belong to the first tens of millions of documents in the database during the initial indexing process.

Here is the process I followed to debug this issue and determine where the problem lay:

First, we obtained a random sample of 5000 RD IDs from the database from the last RD indexed: 22410942 using the following script: get_random_rds_ids.txt

Divided the IDs into 5 bins and confirmed that the sample distribution looked good enough: Screenshot 2024-04-17 at 2 36 23 p m

Then we queried each bin of IDs in the recap_vectors index, and each query looked like this: bin_1_query_rd_ids.txt bin_2_query_rd_ids.json bin_3_query_rd_ids.json bin_4_query_rd_ids.json bin_5_query_rd_ids.json

It returned the results: random_results_1.txt random_results_2.json random_results_3.json random_results_4.json random_results_5.json

From each set of results, I extracted all the matched RD IDs and subtracted them from the IDs from each query to find the ones that were missed in the index.

We can see a histogram of each bin comparing the original number of RDs from the random sample versus the missed RDs in each bin.

Bin 1: Screenshot 2024-04-17 at 2 42 27 p m

Items in the bin: 1015 Missed items: 104 Missed percentage: 10.24%

Bin 2: Screenshot 2024-04-17 at 2 42 49 p m Items in the bin: 1000 Missed items: 55 Missed percentage: 5.5%

Bin 3: Screenshot 2024-04-17 at 2 43 03 p m Items in the bin: 984 Missed items: 27 Missed percentage: 2.74%

Bin 4: Screenshot 2024-04-17 at 2 43 18 p m Items in the bin: 1016 Missed items: 24 Missed percentage: 2.36%

Bin 5: Screenshot 2024-04-17 at 2 43 30 p m Items in the bin: 984 Missed items: 30 Missed percentage: 3.04%

Total items in the sample = 4999 Total missed: 240 Percentage of missed items: 4.80%

However, there was still something strange that didn't match: Why was the current growth rate of new documents so high, around 50%, when the missing rate in the first part of the index was around 10.24%?

I took a different approach to try to answer this question.

The sample we took belongs to RECAPDocuments. However, the initial indexing process was based on Dockets. This means that if a Docket was missed, all their associated RECAPDocuments were missed too. Therefore, for each missed RECAPDocument we identified, there could be many more that belong to the same docket.

I took a sample of 5,000 random Docket IDs from the DEV DB and analyzed it.

The theory I had is that the RECAPDocuments density on dockets is not evenly spread across all the dockets in the archive.

So, I divided the 5,000 random IDs into 5 bins and obtained the RECAPDocument count for each Docket. Here are the results:

Docket Bin 1: Screenshot 2024-04-17 at 3 15 48 p m

Docket Bin 2: Screenshot 2024-04-17 at 3 16 51 p m

Docket Bin 3: Screenshot 2024-04-17 at 3 17 22 p m

Docket Bin 4: Screenshot 2024-04-17 at 3 22 03 p m

Docket Bin 5: Screenshot 2024-04-17 at 3 22 13 p m

From here, we can observe that the density of RECAPDocuments across Dockets varies significantly.

We are particularly interested in the first tens of millions of documents in the DB, which are the current ones with a high rate of new documents during re-indexing.

From Docket Bin 1, we can see there is a quite dense area of RECAPDocuments within Dockets: Screenshot 2024-04-17 at 3 27 16 p m

So the question to answer is: Is it possible that most of the missing RECAPDocuments belong to dockets in this dense region? That would mean we missed a lot the dockets from this region.

To answer this question, I ran the following script using the Dev DB:

from cl.search.models import RECAPDocument

missed_1 = [22757543, 22809778, 23289535, 23708599, 24375459, 24397298, 24516674, 24588258, 24597368, 24799067, 24922153, 25109088, 25347814, 25485395, 26019844, 26069605, 26081877, 26151743, 26355649, 26462572, 26582734, 26952237, 26983847, 27033473, 27134655, 27431757, 28332687, 30108557, 30151910, 30642871, 30975794, 33773170, 35315330, 35718229, 36842676, 37611367, 38007103, 38011927, 40418314, 41844375, 44010134, 44497150, 47228530, 47409815, 49068779, 49833146, 49889846, 50060269, 50598818, 51153196, 51240659, 51285204, 51447387, 51532793, 51994499, 52791135, 52857122, 53329034, 53729161, 53806951, 53807848, 54425751, 55242352, 55295835, 55342834, 55754094, 56221213, 56343893, 56361748, 57430565, 57634955, 58596010, 61415376, 68103072, 75519571, 76819430, 76828970, 76842118, 76856740, 76940343, 77009067, 77047274, 77080725, 77180621, 77448756, 77580149, 77646852, 77686194, 83605467, 87906933, 89587049, 91973744, 92040594, 92109850, 92115246, 92614028, 93161374, 93302686, 94246373, 94398972, 94849919, 94905976, 95157226, 95360326]
missed_2 = [97392704, 98307560, 98503745, 98957322, 100604746, 101253673, 101436562, 101734302, 103464632, 103535712, 103781778, 106501528, 108565287, 109071244, 109381885, 111753284, 111878328, 112082607, 117161000, 118955245, 122779389, 123319155, 123650324, 123958080, 127037751, 129049723, 130998790, 131392117, 131714817, 132096399, 134720714, 138144136, 138730959, 139663131, 141267851, 141318345, 144047051, 144330886, 146461939, 148121160, 148818752, 155919471, 155962089, 155964994, 156320322, 156616572, 156925354, 158526606, 159995078, 160307817, 160751872, 162523760, 162711778, 162809481, 169306523]
missed_3 = [170726551, 172643079, 178869735, 181254091, 181749568, 185256596, 187014355, 187806125, 188892166, 190998852, 191943698, 195177572, 195688976, 197794320, 202434072, 202605057, 203969117, 206281715, 206323973, 207918854, 213060575, 214843748, 214996812, 220582601, 229012110, 234915788, 243110770]
missed_4 = [256659905, 262816633, 263295723, 265471304, 267512508, 269006594, 276808149, 278495415, 279986315, 282322160, 283475119, 291312443, 295791359, 295840279, 300065711, 300415665, 302556990, 302666825, 307218961, 308939024, 313366056, 316854400, 316970814, 317876938]
missed_5 = [325402509, 326854634, 328572409, 330232958, 334517343, 335088681, 338885216, 340552479, 342393654, 342559435, 342562201, 347599503, 348551976, 352422844, 354882580, 355174820, 355239668, 358359085, 362540528, 363113594, 364043095, 365652675, 368817102, 370656781, 370953656, 371272023, 374273336, 376972473, 377230120, 377968801]

all_missed_items = missed_1 + missed_2 + missed_3 + missed_4 + missed_5

all_unique_dockets = []
unique_dockets = RECAPDocument.objects.filter(pk__in=all_missed_items).values_list("docket_entry__docket_id", flat=True).order_by("docket_entry__docket_id").distinct("docket_entry__docket_id")
all_unique_dockets.append(list(unique_dockets))

print("All unique missed dockets:", all_unique_dockets)

This script returns the unique Docket IDs which are parent of the missed RECAPDocuments we got from our sample, considering many of the RECAPDocuments missed could belong to the same docket.

Screenshot 2024-04-18 at 10 00 32 a m

Here we can see that the majority of all the missing RDs we could get from our sample, their dockets are contained within the region of dockets which are pretty dense with RECAPDocuments (around docket IDs ~ 4,000,000 to ~8,000,000).

However, this still doesn't answer the question yet about why the rate of new documents during this region of the DB was around 50%.

So the next step is to measure how dense the specific missed dockets are.

I queried from the Dev DB the number of RDs of each "missing" docket from the sample. from cl.search.models import RECAPDocument

unique_docket_ids_related_missing_rds =[4141556, 4158146, 4165248, 4175783, 4177428, 4178861, 4180400, 4181289, 4181437, 4192039, 4194052, 4200359, 4210967, 4215198, 4215372, 4215939, 4216021, 4222632, 4223252, 4226439, 4228603, 4229149, 4246802, 4251110, 4251480, 4263845, 4269221, 4271588, 4276354, 4303352, 4308644, 4310338, 4317950, 4317985, 4319163, 4323910, 4326751, 4328274, 4351078, 4354802, 4356193, 4363195, 4371805, 4371810, 4373717, 4378675, 4384636, 4390395, 4391960, 4391987, 4398825, 4399752, 4399755, 4408438, 4414060, 4488923, 4493419, 4494927, 4496098, 4499389, 4499449, 4513558, 4516188, 4516818, 4518214, 4519499, 4519603, 4522956, 4524166, 4527762, 4528605, 4530411, 4531095, 4539587, 4565729, 4570530, 4579309, 4608994, 4611208, 4634497, 4663129, 4693022, 4757507, 4767277, 4818232, 4887526, 4999953, 5012313, 5071270, 5075169, 5075512, 5106012, 5124508, 5159506, 5159507, 5187957, 5307404, 5328645, 5364221, 5425359, 5458858, 5568807, 5571178, 5584737, 5585121, 5592639, 5595607, 5617685, 5644596, 5669155, 5683246, 5705689, 5714852, 5809697, 5945678, 5967560, 5980784, 6040548, 6060730, 6075004, 6089782, 6092906, 6099237, 6100941, 6102388, 6105043, 6106344, 6124843, 6133960, 6137166, 6144343, 6144578, 6148661, 6154483, 6156873, 6185778, 6188499, 6190929, 6195185, 6204304, 6207100, 6207611, 6208221, 6217942, 6225905, 6264694, 6266465, 6270109, 6270846, 6272411, 6278610, 6278646, 6292179, 6294045, 6294542, 6295503, 6297966, 6298555, 6305529, 6306044, 6307614, 6308546, 6309029, 6309340, 6309343, 6316208, 6316339, 6323036, 6325247, 6331037, 6331871, 6340040, 6353732, 6357493, 6358041, 6361098, 6362461, 6365106, 6368549, 6377014, 6382996, 6391669, 6393461, 6395420, 6423332, 6431855, 6432129, 6443870, 6445259, 15234409, 15281221, 15281594, 15282091, 15282703, 15285982, 15288781, 15290239, 15291544, 15295372, 15305307, 15310056, 15312439, 15313829, 15516784, 15621212, 15858606, 15908988, 15910012, 15912400, 15912420, 15912723, 15913742, 15914092, 15969298, 15977680, 16004753, 16178319, 16197733, 16199040, 16224289, 17401124, 17401360, 18649393, 18649399, 37451203, 58041792, 59236790, 59239017, 59693243, 59894911, 60399598, 60703404, 61610328, 63333927]

docket_density_missed = {}
for docket_id in unique_docket_ids_related_missing_rds:
    rd_count_per_docket = RECAPDocument.objects.filter(docket_entry__docket_id=docket_id).count()
    docket_density_missed[docket_id] = rd_count_per_docket

print("docket_density_found all rds: ", docket_density_missed)

Screenshot 2024-04-18 at 10 02 43 a m

And yes, we can notice that specifically, the dockets around 4,000,000 to ~8,000,000 where the missing rate is pretty high, have a great density of RDs.

Considering that we missed most of these dockets and their RDs. How are they RDs distributed across the DB? For this, I queried all the RDs for each "missed" docket, using the following script: get_rds_counts_for_missing_dockets.txt

Screenshot 2024-04-18 at 10 07 05 a m

From here, we can see that most of the documents that belong to these missed dockets are at the beginning of the RDs table, which could answer the question of why the rate of new documents indexed has been too high during indexing the first tens million RDs.

I also queried the number of RDs that belong to dockets in the first bin and found something interesting.

From the 911 dockets with RDs found: The total number of RDs which belong to these dockets are: 171,994. That's an average of: 188.79 RD per Docket.

While the numbers for the missing Dockets in the first bin are: Total Dockets: 104 The total number of RDs which belong to these dockets are: 136,688 That's an average of: 1314.30 RD per Docket.

Meaning that dockets with a high number of RDs were the ones that were missing, possibly due to the timeout and connection errors issues we were experiencing during that time.

To answer the question about whether the high rate of new documents we're currently experiencing in the sweep indexer (around 50%) is ok I did the following:

Took the missing RDs rates from the sample vs the RDs found.

Screenshot 2024-04-18 at 10 26 14 a m

I computed a summation, considering the statistical missing rate on each bin to compute the expected missing elements on each bin based on the actual numbers in DB within that bin and got these results:


Bin: 0
Percentage of missed: 0.391304347826087
Actual number of items in the bin: 3662989
Theoretical missed: 1433343.5217391306
-----------
Bin: 1
Percentage of missed: 0.18
Actual number of items in the bin: 3677965
Theoretical missed: 662033.7
-----------
Bin: 2
Percentage of missed: 0.07142857142857142
Actual number of items in the bin: 3630660
Theoretical missed: 259332.85714285713
-----------
Bin: 3
Percentage of missed: 0.06153846153846154
Actual number of items in the bin: 3547033
Theoretical missed: 218278.95384615386
-----------
Bin: 4
Percentage of missed: 0.07692307692307693
Actual number of items in the bin: 3573099
Theoretical missed: 274853.76923076925
-----------
Bin: 5
Percentage of missed: 0.038461538461538464
Actual number of items in the bin: 3538916
Theoretical missed: 136112.15384615384
-----------
Bin: 6
Percentage of missed: 0.05555555555555555
Actual number of items in the bin: 3511213
Theoretical missed: 195067.38888888888
-----------
Bin: 7
Percentage of missed: 0.11538461538461539
Actual number of items in the bin: 3649675
Theoretical missed: 421116.3461538462
-----------
Bin: 8
Percentage of missed: 0.2608695652173913
Actual number of items in the bin: 3666255
Theoretical missed: 956414.3478260869
-----------
Bin: 9
Percentage of missed: 0.20454545454545456
Actual number of items in the bin: 3580019
Theoretical missed: 732276.6136363636
-----------
Bin: 10
Percentage of missed: 0.03278688524590164
Actual number of items in the bin: 3571374
Theoretical missed: 117094.22950819673
...
Bin: 100

Whole results here: expected_missing_documents.txt

Total theoretical RDs missed: 17,512,370

Something interesting is that the first bin which belong to RDs close to the last RD indexed (22410942) has a missing rate of 39%. Unfortunately, this method can't tell us the missing rate of previous stages of the current indexing process.

The 39% is not exactly the 50% of new documents rate we were experiencing, however, it's a good approximation. Considering the noise introduced by the fact that documents were not missed randomly, but rather due to the volume of RDs in a docket, large dockets were more likely to be missed. I tried to think of a method that takes this variable into account, but I couldn't come up anything helpful with the available data. I think the only way to achieve this is to identify all the dockets that were missed, count their exact RDs, and determine their exact locations. However, this isn't practical since we're talking about at least 3 million dockets and tens of millions of RDs.

However with this sample I think the numbers are quite solid.

So analyzing the results as a whole:

Considering the sample was taken from 22,410,942 RD ID And the last RD ID in the DB at the moment was: 393,006,998 (Approximate number it doesn't take into account deleted documents.)

And the theoretical number of missing RDs that belong to the dockets that were likely missed is: 17,512,370 And the number of RDs indexed at the moment was: 373,358,555

Te numbers make sense: 373,358,555 + 17,512,370 = 390,870,925 ≈ 393,006,998

The difference of ~3M documents could be explained by the noise related to dockets missed were the big ones as described above.

In conclusion:

It seems that everything is OK in terms of indexing, the data suggest that the high indexing rate is due to missing a lot of dense dockets and their RDs during the first indexing, so we expected that the rate of new documents indexed to decrease as the sweep_indexer advances and the number of RDs converges to the expected number so it matches the RD number in the DB.

@mlissner Before starting the sweep_indexer, can we get the following counts? So we can have the current counts and the exact number of documents expected at the end for Dockets and RDs.

from cl.search.models import Docket, RECAPDocument

docket_count_to_index = Docket.objects.filter(source__in=Docket.RECAP_SOURCES).count()
print("docket_count_to_index:", docket_count_to_index)

rds_to_index = RECAPDocument.objects.all().count()
print("rds_to_index: ", rds_to_index)

And on Kibana:

GET /recap_vectors/_count
# No body

GET /recap_vectors/_count
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "docket_child": "recap_document"
          }
        }
      ]
    }
  }
}

GET /recap_vectors/_count
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "docket_child": "docket"
          }
        }
      ]
    }
  }
}

mlissner commented 6 months ago

Great analysis, thank you very much Alberto for getting to the bottom of this so we can be sure we're not making a big mistake when we index.

Here's the result from your query:

docket_count_to_index: 56906063
rds_to_index:  392385753

And the counts:

{
  "count": 431919665,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}

{
  "count": 375013290,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}

{
  "count": 56906961,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  }
}

albertisfu commented 6 months ago

Great thanks! the results looks great, so we can re-enable the sweep indexer whenever you want

mlissner commented 6 months ago

Great. It's going again. I think I'll mark this as complete and move it to your In Review column. Thanks again!

freelawproject / courtlistener