Open gterzian opened 1 year ago
This is running this branch, using Automerge 0.5.0. (problem was similar with ChangeGraph
when on 0.4.0).
You can get to 100% CPU almost immediately if you remove this sleep
.
Most of the work that has to be done in get_changes
is generating clocks, so what this is telling us is that the application is spending most of its time in get_changes
. I think we'll need to do a bit more investigating to figure out whether this is a problem in automerge
, spanreed
, or the bakery example. Questions that occur to me:
get_changes
? I.e. what is the average time of a get_changes
call? Does this increase over time?get_changes
? How big are the documents involved? Does this increase over time?get_changes
call?Actually, given that get_changes
is most likely being called inside generate_sync_message
we should probably repeat all the questions above except replace get_changes
with generate_sync_message
I will double check when I am back in the office next week, but so far:
Some prints from the beginning of the run(low 10s CPU usage):
Messages received from RepoId("12c0d873-7aa5-4dfc-96a2-4eb9001cd53b") 3
Running receive_sync_message took 28208 milliseconds.
Running receive_sync_message took 22958 milliseconds.
Running receive_sync_message took 21792 milliseconds.
Running generate_sync_message took 35542 milliseconds.
Running generate_sync_message took 39375 milliseconds.
Generating 0 sync messages 78833 milliseconds.
Syncing docs with 3 sync messages took 199500 milliseconds.
Syncing docs with 0 sync messages took 42 milliseconds.
Messages received from RepoId("b9e46fdc-291a-4b63-9d41-baf5dcd60b33") 1
Running receive_sync_message took 19083 milliseconds.
Running generate_sync_message took 30541 milliseconds.
Running generate_sync_message took 26042 milliseconds.
Generating 0 sync messages 59958 milliseconds.
Syncing docs with 1 sync messages took 88416 milliseconds.
Prints from the end of the run, about a cup of coffee later (at 99% CPU)
Messages received from RepoId("b9e46fdc-291a-4b63-9d41-baf5dcd60b33") 3
Running receive_sync_message took 4690708 milliseconds.
Running receive_sync_message took 4640333 milliseconds.
Running receive_sync_message took 4661333 milliseconds.
Running generate_sync_message took 8178041 milliseconds.
Running generate_sync_message took 8043792 milliseconds.
Generating 1 sync messages 16230500 milliseconds.
Syncing docs with 3 sync messages took 30251834 milliseconds.
Running generate_sync_message took 8201792 milliseconds.
Running generate_sync_message took 8129125 milliseconds.
Generating 0 sync messages 16337667 milliseconds.
Syncing docs with 0 sync messages took 500 milliseconds.
Messages received from RepoId("12c0d873-7aa5-4dfc-96a2-4eb9001cd53b") 2
Running receive_sync_message took 4331584 milliseconds.
Running receive_sync_message took 3965834 milliseconds.
Running generate_sync_message took 8085666 milliseconds.
Running generate_sync_message took 8069416 milliseconds.
Generating 1 sync messages 16166167 milliseconds.
Syncing docs with 2 sync messages took 24489042 milliseconds.
It seems timings overall go up by about 150-200 times, while the number of messages applied/generated stays the same.
For context, here is a PR that shows where those prints come from.
I think this answers 1 above. @alexjg
(by the way those are nanoseconds, not milliseconds, I made an error in the prints)
This is running in release
?
This is running in release?
Yes, running the commands as found in the README
.
How big are the documents involved? Does this increase over time?
The documents consists of this. The u32
are incremented over time. The maps do not grow over time.
Nice. Okay I think what we need to do is get hold of a single problematic message and document somehow.
It appears that as the ChangeGraph
grows, this loop takes more and more time:
All the manipulation of data structures in that loop grow at about the same rate, with the btrees being the most expensive.
the heads
argument to clock_for_heads
remains constant, usually one or two.
I've opened a PR that highlights the issue. The test currently fails because the duration of calling clock_for_heads
for a large change graph takes about 2 millis, while getting one for a small change graph takes less than one millis. Measured in nanos it's about a 75x difference. ff you increase the numbers of changes by ten, the difference goes up to by ten as well, to about 750 times.
Perhaps a dupe of, or similar issue as, https://github.com/automerge/automerge/issues/596
When running the bakery example with Spanreed, I've noticed the following:
At the beginning of running the program(as part of a three process ensemble), CPU usage is a few percent, and sampling the process shows the repo thread not doing much(mostly blocked on the
select
).if you let the example run for a while, say 10 minutes, CPU usage goes to close to 100%, and the sample shows the repo thread spending most of it's time running
Repo::sync_documents
, which is mostly runningAutomerge::clock_at
, which seems to mostly spend time manipulating a btree.Detailed view of
Repo::sync_documents
:And a detailed view of
Automerge::get_changes
: