datahub-upgrade RestoreIndices process performance is poor on large sets of data

nmbryant commented 1 month ago

Describe the bug The datahub-upgrade RestoreIndices process significantly degrades as it runs when running against a large set of data. In our lower environment we have 9 million aspects and in our other environments we have 50+ million. With urnBasedPagination set to false, threads set to 10 and running in dev on all data, it was able to process 1 million in 20 minutes which was reasonably. However, as it runs the estimate to complete starts getting longer and longer with each batch taking significantly more time. After running for over 12 hours, it had only processed 4 million aspects before I stopped it.

Because of this, back in December we implemented urnBasedPagination which had linear performance regardless of the number of aspects. When we implemented it, RestoreIndices was able to process all 9M aspects in the dev environment in less than an hour. However, after upgrading to v0.12.1 from v0.10.5 the process is much slower with urnBasedPagination although still linear execution time, taking 1 hour to process 600K aspects.

To Reproduce Steps to reproduce the behavior:

Run the RestoreIndices process of datahub-upgrade on all aspects, need at least a few million aspects to reproduce
With urnBasedPagination set to false and threads at 10, should see the performance of each batch degrade over time
With urnBasedPagination set to true, should see linear performance but slow at 600K aspects per hour

Expected behavior Expect the process when threading and urnBasedPagination to false to be able to run against a large set of data in a reasonable amount of time. As it is now, it is not usable for large amounts of data. With urnBasedPagination set to true, would like the performance restored to what it was when we first implemented it.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: iOS
Browser: Chrome
Version v0.13.3

Additional context My guess is that it's the change from PagedList to stream in EbeanAspectDao. This is the version of EbeanAspectDao that is performant for us: https://github.com/nmbryant/datahub/blob/3ed6102d875b40452354d4044d7b89724c702ab6/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java#L464

From my testing, it appears that it became slower after this change: https://github.com/datahub-project/datahub/commit/9a0a53bbe44d0ca5ea420c1654f87c4f7c636c67

@david-leifker What was the reason for switching it to a stream and do you think I would be able to easily switch back to a PagedList to see if that would improve the performance?

nmbryant commented 1 month ago

After some further analysis I was able to make some progress. I was able to restore the performance by adding the original restoreIndices function to EntityServiceImpl and adding getPagedAspects to EbeanAspectDao.

Comparing the results between these changes and what's currently in main, there appear to be 2 parts that are significantly slower with the newer code.

This step causes significant slowdown and performance for the full restore was improved a lot by commenting it out. When doing a full restore, it seems that running this step isn't really viable or needs to be improved. Should there be an env var to disable this step or should improvements be made to it's performance?

The SQL queries take significantly longer compared to with the old code. Here are is a comparison of the first 5000 batch between the new code and the old code:

  # Old Code
  2024-09-03 19:54:05,281 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=5000, timeSqlQueryMs=3, timeGetRowMs=0, timeUrnMs=18, timeEntityRegistryCheckMs=13, aspectCheckMs=3, createRecordMs=512, sendMessageMs=2002, defaultAspectsCreated=0, lastUrn=, lastAspect=)

  # New Code
  2024-09-04 13:28:38,668 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:15 - metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=5000, timeSqlQueryMs=438, timeGetRowMs=0, timeUrnMs=11, timeEntityRegistryCheckMs=9, aspectCheckMs=12, createRecordMs=3, sendMessageMs=29947, defaultAspectsCreated=0, lastUrn=, lastAspect=)

The timeSqlQueryMs value increases with each batch which makes it seem like it gets slower over time as well since it doesn't seem like a cumulative metric and should just be for the particular batch.

  # Later batch (note this is with the step 6 mentioned earlier commented out)
  2024-09-04 14:30:19,396 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - metrics so far RestoreIndicesResult(ignored=0, rowsMigrated=785000, timeSqlQueryMs=389751, timeGetRowMs=0, timeUrnMs=89, timeEntityRegistryCheckMs=176, aspectCheckMs=266, createRecordMs=33, sendMessageMs=39263, defaultAspectsCreated=0, lastUrn=, lastAspect=)

As of now I'm not sure the cause for the slowdown with the SQL query but it is something that needs to be improved.

Note that these comments only apply to using urnBasedPagination and I haven't looked into the degrading performance over time when having that set to false with threads.

nmbryant commented 1 month ago

Actually, it looks like the overall performance was pretty close to the old code once the MCP step was commented out. So these are the changes that I would propose:

Short term solution to have a flag to disable MCP events in RestoreIndices process
Long term solution to improve performance of sending the MCP events

Another issue I'm getting with the new code is this timeout error:

java.util.concurrent.TimeoutException: Exceeded request timeout of 10000ms

Anyone know which variable I can change to increase this timeout? This doesn't happen with the old code and I am able restore the entire dev environment with 9M aspects.

vejeta commented 4 days ago

AFAIK, the timeout had a default time that was removed a month ago here:

https://github.com/datahub-project/datahub/pull/11456/files#diff-4c4357beca85c5d52050b7923d49c04753f291f51867a06b4e6ea69ff1f0a0c9R219

After a reported similar issue: https://github.com/datahub-project/datahub/issues/10557

If you are in an older version of DataHub, and don´t want to upgrade yet. You can change it here: https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java#L150

datahub-project / datahub

datahub-upgrade RestoreIndices process performance is poor on large sets of data #11276