Closed ara4n closed 6 years ago
looking at memory-infra in chrome://tracing/, looks like we are running out of heap space and GCing to death. During startup (whilst the spinner is going) it has 1.3GB on the v8 heap (1.9GB of memory total), which is enough to grind to a halt. so the question is: what is this taking up all the heap?
So it looks like during the spinner (whilst it's going through slowly saying rageshake.js:65 Enabling encryption in !.......:matrix.org; starting to track device lists for all users therein
, we have about 700MB of heap - although this is almost all owned by 1650 Promise objects.
Then on snapshotting almost immediately after 'MatrixClient sync state => PREPARED', i caught it with 1.0GB of heap. There are now 8280 Promises, but the memory is instead all retained by MatrixEvents and RoomMembers as we'd expect.
I wonder if there's some overlap phase between the two regimes where we can try to allocate 2x700MB temporarily, which will be enough to nuke it.
The 1650 Promises which hold onto 700MB of heap during spinner are all pending activities from the RemoteIndexedDBStoreBackend worker, apparently. Meanwhile there are a ton (1.2 million) of plain old Objects which retain most of the RAM: It looks like it might somehow be a raw JS struct representation of the full accumulated sync response? So perhaps the problem is that we end up holding both this and the resulting MatrixEvents and RoomMembers etc created from it at the same time...
(this is obviously related to the overall 'running out of heap' bug over at #3307)
i'm wondering if one can do something really naive here like null out the bits of the sync accumulator response once you've parsed them so the GC can go and do its thing...
I've been looking at this for a while and all I can conclude is that the object graph that gets sent over from the worker process to the main one is ended up a lot larger than in the main proc than it was in the worker. The extra seems to be system / Map
s, ie. v8s internal hidden classes. Perhaps data being pulled out of indexeddb is going straight to dictionary based objects, but objects deserialised from a structured clone are getting instantiated as the fast map objects?
Only thing I can think of doing to mitigate this is opening the indexedb in the main process to pull the sync data out, then leaving the writes to be done in the worker?
i profiled earlier via chrome://tracing (given heap snapshots seem to kill my browser solid) and it seemed pretty sane - max heap was 800MB and dropped to 600MB after launch. It'd be nice if we can avoid the hidden classes by doing this in the main process, but in practice I'm not sure we consider this a blocker for 0.14 and perhaps deprioritise it from this milestone.
So something is definitely screwed here. I couldn't load my Riot/Web yesterday until i turned off sandboxing - however i caught a 1.6GB heap in the act (when Chrome paused because it realised it was about to OOM).
sierra:heapanalyser matthew$ ./heap-analyser.pl 1.6gb.heap
1 array 547763200 3485244
0 hidden 505548848 9756717
3 object 498130728 8462097
2 string 131988288 2452936
4 code 13962104 53921
5 closure 11067488 179646
7 number 6916080 432255
10 concatenated_string 4493200 112330
6 regexp 552776 9871
8 native 524624 347
11 sliced_string 19440 486
total_size: 1720967416
...but more worryingly:
object 5502556 274465336 Object
object 1451294 46441408 Array
object 728761 87451320 RoomMember
object 395500 50624000 MatrixEvent
object 245595 29471400 User
Where on earth has it got 728K RoomMembers from!?!?!?
https://github.com/matrix-org/riot-web-rageshakes/issues/712 is the rageshake.
(Erik was also having problems)
Looking in my 1.6GB heap with chrome dev tools there's something clearly wrong - looks like we might be leaking react contexts or timeline windows or something:
I just caught this in the act again. It looks like I've ended up with duplicates of all my Rooms at the very list - and on comparing my two copies of HQ in the heap dump, it even looks like I have two IndexedDBStores; one owned by the main matrixclient; the other seems to be mainly retained by React components (e.g. RoomList's _visibleRooms?!). I've captured the heap, and I'm leaving Chrome running (breakpointed) in an effort to try to work out wtf has happened.
there looks to be a smoking gun at:
23:40:44.367 rageshake.js:65 IndexedDBStore.startup: connecting to backend
23:40:44.367 rageshake.js:65 MatrixClientPeg: waiting for MatrixClient store to initialise
23:40:46.333 rageshake.js:65 Switching to room alias #riot-dev:matrix.org at event undefined
23:40:46.334 /develop/#/room/#riot-dev:matrix.org:1 Slow network is detected. Fallback font will be used while loading: https://riot.im/develop/fonts/Open_San
s/OpenSans-Regular.ttf
23:40:46.869 rageshake.js:65 Presence: online
23:40:51.954 rageshake.js:65 IndexedDB worker is ready
23:40:51.954 indexeddb-local-backend.js?79bc:122 LocalIndexedDBStoreBackend.connect: connecting
23:40:51.954 indexeddb-worker.js:3748 LocalIndexedDBStoreBackend.connect: connecting
23:40:51.955 indexeddb-local-backend.js?79bc:144 LocalIndexedDBStoreBackend.connect: awaiting connection
23:40:51.955 indexeddb-worker.js:3765 LocalIndexedDBStoreBackend.connect: awaiting connection
23:40:51.956 indexeddb-local-backend.js?79bc:148 LocalIndexedDBStoreBackend.connect: connected
23:40:51.956 indexeddb-worker.js:3767 LocalIndexedDBStoreBackend.connect: connected
23:40:51.956 indexeddb-local-backend.js?79bc:343 LocalIndexedDBStoreBackend: loading account data
23:40:51.956 indexeddb-worker.js:3986 LocalIndexedDBStoreBackend: loading account data
23:40:51.956 indexeddb-local-backend.js?79bc:347 LocalIndexedDBStoreBackend: loaded account data
23:40:51.956 indexeddb-worker.js:3988 LocalIndexedDBStoreBackend: loaded account data
23:40:51.956 indexeddb-local-backend.js?79bc:363 LocalIndexedDBStoreBackend: loaded sync data
23:40:51.956 indexeddb-worker.js:4004 LocalIndexedDBStoreBackend: loaded sync data
23:40:51.956 indexeddb-local-backend.js?79bc:367 LocalIndexedDBStoreBackend: loaded sync data
23:40:51.957 indexeddb-worker.js:4006 LocalIndexedDBStoreBackend: loaded sync data
23:40:54.581 indexeddb-local-backend.js?79bc:172 LocalIndexedDBStoreBackend: loaded initial data
23:40:54.582 indexeddb-worker.js:3792 LocalIndexedDBStoreBackend: loaded initial data
23:40:54.924 rageshake.js:65 IndexedDBStore.startup: loading presence events
23:40:54.992 rageshake.js:65 IndexedDBStore.startup: processing presence events
...the duplicated log lines look like there may be two indexeddb stores right from the outset? I don't understand why the log locations don't match.
Circumstantially (from the two rageshakes I have of this happening), it looks a bit like we're okay if the ordering of indexeddb startup is:
2018-02-26T13:48:48.091Z I IndexedDB worker is ready
2018-02-26T13:48:48.248Z I IndexedDBStore.startup: connecting to backend
...but badness happens if it's...
23:40:44.367 rageshake.js:65 IndexedDBStore.startup: connecting to backend
23:40:51.954 rageshake.js:65 IndexedDB worker is ready
The other weird thing is that according to the heap I even have two MatrixClientPegs; both stored in a field called mxMatrixClientPeg on the Window object - except only one Window is the current live one; the other seems to be old and retained only by chunks of DOM? I loaded this particular instance by either hitting Cmd-R or the reload-to-latest-app button - i wonder if the whole stale previous version of the app is still hanging around somehow?!
right, from counting the objects on the heap, this seems to be the case. we have two of everything other than some components like MessageComposer which only exist from being kept alive by stale DOM references. Perhaps it's even possible that the duplicate log lines from indexeddb is because we've ended up with two workers picking up the same postMessages? (although only one worker shows up in the heap profiler dropdown).
In case this is idiocy caused by window.location.reload
not being strong enough, i've just pushed https://github.com/vector-im/riot-web/commit/aa540325004cbdb1ffa66c5c1ff955f6cee0f045.
I've also pushed matrix-org/matrix-js-sdk@beafd59 just in case we're somehow managing to get the same indexeddb worker process to connect twice.
Just got it again; precisely the same behaviour of hitting the 'update' button from the 'recent updates' dialog box, despite the tweaks above. Again, two MatrixClients - one owned by the active Window; the other owned by random stale react components from the DOM from god knows where.
I think our only recourse now might be to try to get React & the service worker to do an orderly shutdown before calling window.location.reload
- or possibly remove the button entirely and replace it with a "refresh your browser to upgrade" message?
One difference this time is that we don't see duplicate logging from index workers any more.
Also, it takes 3 minutes to load the sync response from LocalIndexedDBStoreBackend??
Initialised rageshake: See https://bugs.chromium.org/p/chromium/issues/detail?id=583193 to fix line numbers on Chrome.
10:41:44.609 rageshake.js:65 set language to en
10:41:44.645 index.js:218 GET https://riot.im/develop/config.riot.im.json 404 (Not Found)
10:41:44.671 rageshake.js:65 Vector starting at https://riot.im/develop/#/room/#polynomial-supporters:matrix.org
10:41:44.695 rageshake.js:65 Restoring session for @matthew:matrix.org
10:41:44.709 rageshake.js:65 setLoggedIn: mxid: @matthew:matrix.org deviceId: MWPGAWRKQL guest: false hs: https://matrix.org
10:41:44.718 rageshake.js:65 Session persisted for @matthew:matrix.org
10:41:44.727 rageshake.js:65 Lifecycle: Starting MatrixClient
10:41:44.741 rageshake.js:65 connecting to indexeddb matrix-js-sdk:crypto
10:41:44.762 rageshake.js:65 connected to indexeddb matrix-js-sdk:crypto
10:41:44.777 rageshake.js:65 Initialised anonymous analytics
10:41:44.838 rageshake.js:65 Presence: online
10:41:45.020 rageshake.js:65 IndexedDBStore.startup: connecting to backend
10:41:45.021 rageshake.js:65 MatrixClientPeg: waiting for MatrixClient store to initialise
10:41:45.081 rageshake.js:65 Switching to room alias #polynomial-supporters:matrix.org at event undefined
10:41:45.106 rageshake.js:65 IndexedDB worker is ready
10:41:45.107 indexeddb-local-backend.js?79bc:125 LocalIndexedDBStoreBackend.connect: connecting...
10:41:45.108 indexeddb-local-backend.js?79bc:147 LocalIndexedDBStoreBackend.connect: awaiting connection...
10:41:45.109 indexeddb-local-backend.js?79bc:151 LocalIndexedDBStoreBackend.connect: connected
10:41:45.110 indexeddb-local-backend.js?79bc:346 LocalIndexedDBStoreBackend: loading account data...
10:41:45.110 indexeddb-local-backend.js?79bc:368 LocalIndexedDBStoreBackend: loading sync data...
10:41:45.117 indexeddb-local-backend.js?79bc:355 LocalIndexedDBStoreBackend: loaded account data
10:41:45.549 rageshake.js:65 Switching to room alias #polynomial-supporters:matrix.org at event undefined
10:41:48.026 indexeddb-local-backend.js?79bc:377 LocalIndexedDBStoreBackend: loaded sync data
10:41:48.027 indexeddb-local-backend.js?79bc:175 LocalIndexedDBStoreBackend: loaded initial data
10:41:48.429 rageshake.js:65 IndexedDBStore.startup: loading presence events
10:41:48.502 rageshake.js:65 IndexedDBStore.startup: processing presence events
10:41:48.510 rageshake.js:65 MatrixClientPeg: really starting MatrixClient
10:41:48.513 rageshake.js:65 MatrixClientPeg: MatrixClient started
10:41:48.628 rageshake.js:65 Got TURN URIs: turn:turn.matrix.org:3478?transport=udp,turn:turn.matrix.org:3478?transport=tcp,turns:turn.matrix.org:443?transport=tcp refresh in 86400 secs
10:41:49.017 rageshake.js:65 Looking for queued outgoing room key requests
10:41:49.026 rageshake.js:65 No more outgoing room key requests
our best theory right now is that this isn't caused by clicking the 'update' button, but possibly due to chrome dev tools keeping the old v8 data around if its open? (That said, i had an Aw Snap yesterday during launch irrespective of that)
This stopped happening. I reckon it was a Chrome bug causing Chrome to not flush its DOM correctly when you call window.location.reload
.
we are using ~2x less heap, but something seems to have changed in the memory allocation patterns such that we are now at much higher risk of OOMing at startup than before