Gig-o-Matic / GO3

GNU General Public License v3.0
10 stars 8 forks source link

Email storm? #449

Open bklang opened 7 months ago

bklang commented 7 months ago

SFMA sent thousands of emails related to gig changes. The band admin says she only hit save a few times, and that way more emails were sent than the number of times she saved changes. Needs investigation.

aaronoppenheimer commented 7 months ago

Eek, that's weird. Did it happen as part of the import?

On Mon, Apr 22, 2024, at 10:44 PM, Ben Klang wrote:

SFMA sent thousands of emails related to gig changes. The band admin says she only hit save a few times, and that way more emails were sent than the number of times she saved changes. Needs investigation.

— Reply to this email directly, view it on GitHub https://github.com/Gig-o-Matic/GO3/issues/449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWDDNSAKPDY673WHPIDZGTY6XDKNAVCNFSM6AAAAABGT5NPJGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TOOBSGY4DEOA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

aaronoppenheimer commented 7 months ago

Seed & Feed is really big. Looking at how gig emails are generated, I wonder if it timed out and started over a few times. It is very inefficient - it generates the changes for the gig email for every member instead of just doing it once and emailing it to everyone...

On Mon, Apr 22, 2024, at 11:06 PM, Aaron Oppenheimer wrote:

Eek, that's weird. Did it happen as part of the import?

On Mon, Apr 22, 2024, at 10:44 PM, Ben Klang wrote:

SFMA sent thousands of emails related to gig changes. The band admin says she only hit save a few times, and that way more emails were sent than the number of times she saved changes. Needs investigation.

— Reply to this email directly, view it on GitHub https://github.com/Gig-o-Matic/GO3/issues/449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWDDNSAKPDY673WHPIDZGTY6XDKNAVCNFSM6AAAAABGT5NPJGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TOOBSGY4DEOA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

aaronoppenheimer commented 7 months ago

Hm. The async_task call made in email.py has "ack_failure=True" which should cause failed tasks to just be skipped, and there's nothing in the "failed tasks" list in the DjangoQ section of the admin page.

One thing that might help - we use the simple_history system to maintain the history of objects like gigs. We can enable browsing of the history in the admin site as described here. Maybe that would show if the gigs got saved a bunch of times and that's what triggered all the emails.

bklang commented 7 months ago

Both are great ideas. I was looking at ack_failures too. For tonight I’ve disabled the email integration to be safe. Will jump on it in the morning.On Apr 22, 2024, at 11:59 PM, aaronoppenheimer @.***> wrote: Hm. The async_task call made in email.py has "ack_failure=True" which should cause failed tasks to just be skipped, and there's nothing in the "failed tasks" list in the DjangoQ section of the admin page. One thing that might help - we use the simple_history system to maintain the history of objects like gigs. We can enable browsing of the history in the admin site as described here. Maybe that would show if the gigs got saved a bunch of times and that's what triggered all the emails.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

bklang commented 7 months ago

I think I found the problem. I wrote a task that counted the elapsed time in seconds. After 30 seconds, the task was terminated, and then immediately restarted. This was even with ack_failure=True. Here's the output (1 second delay between lines):

Process-617985f99fe54a919e2ac39215cb88a2 processing fix-cat-alanine-oranges 'lib.email.do_slow_work'
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
reincarnated worker Process-617985f99fe54a919e2ac39215cb88a2 after timeout
couldn't import psycopg 'c' implementation: No module named 'psycopg_c'
loading member signals
loading band signals
loading motd signals
loading gig signals
Process-9e2b35e928c249d69e99b46f70194682 ready for work at 36497
queueing from DjangORM
Process-9e2b35e928c249d69e99b46f70194682 processing fix-cat-alanine-oranges 'lib.email.do_slow_work'
0
aaronoppenheimer commented 7 months ago

Yeah, this is how BABAM used to break the gig-o whenever they added 100 members or whatever! Great job finding the configuration typo. I'll think about how to do the email sending more efficiently (or at least how to chunk it up so a single task isn't sending email to 300 people).

bklang commented 7 months ago

Issue is not yet resolved, just worked around. Leaving this open. Next step is to try to resolve the upstream issue

aaronoppenheimer commented 7 months ago

We should also just lower the burden on sending these emails. For every member we go through the process of calculating the changes from the last version of the gig - could probably just do this once!

bklang commented 7 months ago

We should also just lower the burden on sending these emails. For every member we go through the process of calculating the changes from the last version of the gig - could probably just do this once!

It certainly wouldn't hurt. That said, it appears the slowest part of the transaction is the API call to SendGrid. If we could batch those, it would make a pretty huge difference.