BIDMCDigitalPsychiatry / LAMP-platform

The LAMP Platform (issues and documentation).
https://docs.lamp.digital/
Other
12 stars 10 forks source link

Dashboard taking forever to load or never loading #722

Closed Tuna9129 closed 1 year ago

Tuna9129 commented 1 year ago

Describe the bug An investigator/research account with the id 'yhe0wtfn6n6sbvsan0js' is not loading at all from the admin user. I have tried on different browsers (on a macbook pro) and my coworkers have tried as well with no luck. I have tried on both the regular and staging dashboard. It is stuck on the loading wheel and I cannot click on anything.

We have also observed that sometimes, other investigator/researcher accounts will take a very long time to load. But after waiting a few hours or a day, it will work fine. However, this study has not been accessible at all since early yesterday.

Past similar/related issues?: https://github.com/BIDMCDigitalPsychiatry/LAMP-platform/issues/491 https://github.com/BIDMCDigitalPsychiatry/LAMP-platform/issues/515 https://github.com/BIDMCDigitalPsychiatry/LAMP-platform/issues/625

To Reproduce

  1. Go to dashboard
  2. Login
  3. Click into the study
  4. Page will be slightly greyed out with the loading spinner going on forever

Expected behavior After clicking into the study, it should load within a few minutes.

Screenshots

Screenshot 2022-12-08 at 15 09 51

Desktop (please complete the following information):

Additional context Just a video showing the issue. Won't load even if I left it open for 30 minutes. https://user-images.githubusercontent.com/105741216/206572589-45521fe9-9c6a-42e1-a663-25d9d9357024.mov

falmeida-orangeloops commented 1 year ago

Here's what I've found out so far: the list of participants that's displayed by the dashboard is obtained from the server by issuing a POST request to the / server endpoint. This POST request contains a string representing a (pretty complex) JSONata query containing the corresponding researcherId.

For most researcherIds, this query returns more or less quickly, but it seems that for some of them (such as yhe0wtfn6n6sbvsan0js) the server hangs while trying to resolve it. After a while, the request times out, but the dashboard does not take notice of this; instead, it is still displayed as if it was loading, and goes on this way forever. So, to me there are two problems to be solved (in which I'm working on now):

  1. The server hanging with certain queries (here) and not delivering a response (I think it should at least return an error)
  2. The dashboard not responding to possible errors arising from the JSONata query (such as a timeout)
Tuna9129 commented 1 year ago

Hello! Thank you very much for the update and explanation. That sounds great!

falmeida-orangeloops commented 1 year ago

I've got an update on issue 1.

The JSONata query that is issued whenever the user tries to view a researcher in the dashboard basically asks the server to deliver a bunch of data of all the activities belonging to the researcher (the "all" word is important here). This data are organized in several fields. What I did was try to remove fields from the requests one by one to figure out if any of them is causing the problem.

What I found is that for this particular user the request failed unless I removed settings from the set of fields that are retrieved. I don't know exactly about the nature of the data that it holds, but they seem like parameters and settings that tune the way the activity will behave. Here's an example of what the settings field contains for a certain activity:

"settings": {
  "bubble_count": [
    60,
    80,
    80
  ],
  "bubble_speed": [
    60,
    80,
    80
  ],
  "intertrial_duration": 0.5,
  "bubble_duration": 1
}

Another thing I noticed is that this particular user has as much as 11980 activities! That is a ton of data to be downloaded all at once. So the next thing I did was to try lighter queries, say asking for the first 1000 activities, then the following 1000, and so. This way, the queries don't fail, but the responses from the server are very heavy, see:

[1st - 1000th] 200 286.4 MB
[1001st - 2000th] 200 273.1 MB
[2001st - 3000th] 200 264.5 MB
[3001st - 4000th] 200 259.7 MB
...

So to sum up I think the problem boils down to the queries being too heavy to handle when the researcher in question (like this one) has too many activities. I think this could explain similar situations with other researchers and also the server failing "randomly" (it might be related to server load; the more loaded the server is, the more difficult it is for it to handle these heavy queries).

I suspect that only a bit of what's being asked by the dashboard to the server in this query is actually necessary for the dashboard. In that case, the query could be simplified and the problem would be solved. Next thing I'm going to do is try to figure out which parts are actually necessary and then cut everything else out of the query.

avaidyam commented 1 year ago

@falmeida-orangeloops That’s a fantastic write-up and diagnosis. Thank you!

Another thing I noticed is that this particular user has as much as 11980 activities!

This seems very odd but does make sense…

@falmeida-orangeloops Could you try to snoop on the data and check?

falmeida-orangeloops commented 1 year ago

@avaidyam Actually response sizes vary within a pretty wide range (from < 1 kB to > 2 MB)!

As an example, this is the size of the response for the first 50 activities (I can provide you with the full list of activity IDs that are being retrieved for some you to check if you'd like):

Click to expand ``` [1/11980] 46.3 kB [2/11980] 1.6 kB [3/11980] 37.7 kB [4/11980] 55.9 kB [5/11980] 37.8 kB [6/11980] 38.1 kB [7/11980] 1.9 kB [8/11980] 197.4 kB [9/11980] 48.1 kB [10/11980] 2.1 kB [11/11980] 29.7 kB [12/11980] 1.1 kB [13/11980] 2.0 kB [14/11980] 287 Bytes [15/11980] 11.7 kB [16/11980] 362 Bytes [17/11980] 31 Bytes [18/11980] 39 Bytes [19/11980] 1.5 MB [20/11980] 1.4 MB [21/11980] 1.1 MB [22/11980] 2.4 MB [23/11980] 270 Bytes [24/11980] 55.9 kB [25/11980] 54 Bytes [26/11980] 10 Bytes [27/11980] 245 Bytes [28/11980] 281 Bytes [29/11980] 1.0 MB [30/11980] 2.5 MB [31/11980] 1.5 MB [32/11980] 171 Bytes [33/11980] 169 Bytes [34/11980] 169 Bytes [35/11980] 1.0 MB [36/11980] 2.5 MB [37/11980] 1.1 MB [38/11980] 2.4 MB [39/11980] 286 Bytes [40/11980] 264 Bytes [41/11980] 1.3 kB [42/11980] 287 Bytes [43/11980] 2.0 kB [44/11980] 48.1 kB [45/11980] 1.6 kB [46/11980] 10 Bytes [47/11980] 31 Bytes [48/11980] 37.7 kB [49/11980] 39 Bytes [50/11980] 197.4 kB ```

Turns out that all of the activities that are > 1 MB in size (at least among the ones I could check) have full Base64-encoded audio files embedded in them; that's why they are so heavy. The dashboard is trying to download all of them (probably hundreds or thousands) at the same time!

avaidyam commented 1 year ago

@falmeida-orangeloops That makes a lot of sense! If you mask out the particular field with the base64 data, does the total request size drop? What is the new total request size per 10000 events? Also, could you share what the spec field reports for the items that have this data?

falmeida-orangeloops commented 1 year ago

@avaidyam If I mask out that audio field, the total size of the response for all 11980 items is 18 kB, which is much more reasonable!

Also, could you share what the spec field reports for the items that have this data?

All 2010 of them have lamp.breathe in the spec field.

avaidyam commented 1 year ago

That's fantastic! Could you add a comment to the code explaining why we're masking out this particular field? In the future we will likely want to dynamically look up which fields are base64 data and mask them out instead of hardcoding it.

falmeida-orangeloops commented 1 year ago

Could you add a comment to the code explaining why we're masking out this particular field?

Of course! I'll get back when it's ready.

falmeida-orangeloops commented 1 year ago

@avaidyam I removed the audio field from the query. It's working now 👌

Pull Request (just for your information): BIDMCDigitalPsychiatry/LAMP-dashboard#705

Two last comments:

avaidyam commented 1 year ago

Thanks! I think you can go ahead and remove settings entirely. Also, could you use Error Boundaries for item 2?

falmeida-orangeloops commented 1 year ago

I think you can go ahead and remove settings entirely

Nice!

Also, could you use Error Boundaries for item 2?

I'm not sure but I'll try that. Thanks!

falmeida-orangeloops commented 1 year ago

Dashboard not loading for this user (and others with similar conditions) is solved by BIDMCDigitalPsychiatry/LAMP-dashboard#705.

Tuna9129 commented 1 year ago

Hello! Thank you for the updates. I have confirmed that the dashboards now load in the staging dashboard. It takes a few minutes, but it works great now!

falmeida-orangeloops commented 1 year ago

Great to know!

I can dig more into reducing loading times if you think it's worthwhile. Just let me know.

avaidyam commented 1 year ago

It definitely would be worth it but perhaps @Tuna9129 could you share the Chrome DevTools Network log? It will help better gauge whether this is a network transfer issue or now a dashboard UI/loading issue (which would be new/separate).

Tuna9129 commented 1 year ago

Hello! I just tried again, and it seems a faster now. I'm not sure if we need to reduce loading times if we just have to wait like 2 minutes, but here is a screenshot of the Chrome DevTools Network log (I think)! I can also try to send a HAR? file if necessary.

image

avaidyam commented 1 year ago

@Tuna9129 Thanks for the info! I think it's still worth looking into, and great idea to share the HAR file - could you email that to @falmeida-orangeloops & co (cc @ertjlane) since it would have security credentials in it? I'd like to see if it's worthwhile to remove all the repeated calls to the 1 resource in the logs.

Tuna9129 commented 1 year ago

Okay, I'll do that! Thanks for your help!