ctreffe / alfred

Alfred - A library for rapid experiment development
MIT License
10 stars 1 forks source link

Harden Randomizer against more sessions trying to access the randomization slot list at the same time #206

Closed jobrachem closed 2 years ago

jobrachem commented 2 years ago

Quoting @ctreffe (#205):

Currently, I think there are at least two problems regarding alfred's ListRandomizer, resulting in sessions ending with an exception and experiments not filling their randomizer slots as expected. I will discuss these problems in the same issue since I can not rule out that they are connected somehow.

In our latest set of online experiments, we see the following exception multiple times in our logfiles:

2022-06-08 21_40_37-Alfred Error Logentry

As far as I know, this has happened in every online experiment since the v2.3.2 patch for alfred. For my own experiment, I was able to trace the problem back to several experimental sessions starting at exactly the same time. From my experimental dataset and the log, I was able to identify 7 sessions starting in a span of 40 milliseconds. All of these sessions are trying to acquire a randomization slot at almost exactly the same time, which according to @jobrachem might lead to the above error message. The exception is a fallback for the improbable case that too many sessions try to access the randomization slot list at the same time. Of the 7 mentioned sessions, 5 exit the experiment with an exception, and only two sessions continue regularly.

Of course, it is highly unlikely, seeing the average number of sessions per day on the server (less than 100 for my experiment), that seven participants tried to start the same experiment at the same time, only 5 milliseconds apart. The underlying reasons for this seem to be performance issues in the university network, most likely with the outer firewall. Recently, I have run into a couple of performance issues (e.g., experiments taking a long time to start, pages loading with only 50 kb/s, javascript components not working because the script files are slowly loaded). These issues are mostly temporary and after a while (or a couple of hours), the performance normalizes. As far as I can tell from my logfile, the error might occur during such a performance bottleneck, during which multiple server requests build up (I suspect through one or more participants repeatedly trying to reload an experiment's landing page). I suspect the firewall since we've had similar issues with it in the past (remember when the firewall dropped some of the requests for resource files due to performance problems a couple of months back) and our server shows no indication of reaching anywhere near the maximum available resources (cpu, network, memory, etc. in the vm performance monitor). Also, I have seen some other network performance problems with vms yesterday and today (the above error is from yesterday).

In summary, I think we need to harden our Randomizer code against more sessions trying to access the randomization slot list at the same time, to prevent sessions from ending prematurely with the above exception.