Reindexing performance degrades non-linearly

bhalsey commented 4 weeks ago

Reindexing performance degrades non-linearly

Description

A test instance with 460M entities took over 4 days to complete on a 3 node cluster running FusionAuth 1.52.1. The performance slowed significantly after 300M entities were indexed. More work is required to clearly identify the bottleneck. And more work is required to find mitigations, such as doubling the cluster size.

robotdan commented 3 weeks ago

There is not a lot of information here. Is the thesis that this is an problem with FusionAuth, or that we just need to scale Elasticsearch and the relational database adequately?

Ideally we would only open public GH issues for things that need work in FusionAuth.

mooreds commented 3 weeks ago

The servers were all adequately sized: XLs, 1.5TB of disk.

My take is that it is weird that the re-index got exponentially slower. This thread, however, indicates that it is possible it was due to disk i/o: https://discuss.elastic.co/t/reindexing-throughput-degrades-over-time/265279

I couldn't figure out a way to see disk queue depth for the ES nodes, but maybe that would help determine if this was an infra problem.

More supposition here: https://inversoft.slack.com/archives/C051S8N8E/p1728071443879379

So I guess I think this does need some investigation to determine if there are any changes needed to the core product (which, after all, is what controls the re-indexing process).

bhalsey commented 3 weeks ago

Agreed that more work is needed. The outcome could simply be guidance on sizing a cluster. Or it could entail changes to how FusionAuth manages reindexing, such as modifying the index refresh interval.

FusionAuth / fusionauth-issues

Reindexing performance degrades non-linearly #2896

Reindexing performance degrades non-linearly

Description