Closed nonno closed 10 months ago
Thanks for summarizing and collecting links to convos! You are right, I think message deliverability is most important to solve at the moment. Just nobody has had time or needed knowledge.
It's intermittent, e.g. email notifications worked fine fine for me this week, but it's clear from feedback they're not performing well. Locally I haven't been able to reproduce regardless of few attempts, and errors logs don't offer any hints.
I tried to do a test on my machine with two fake users, and I couldn't reproduce the issue but I noticed a DeprecationWarning: collection.update is deprecated. Use updateOne, updateMany or bulkWrite instead.
These are MongoDB API warnings for upgrading the code but won't cause issues until libraries are updated to more recent. Good to update those methods of course. :-)
I would say that the first thing to do could be solving that warning and others if any. Then removing dead code (Facebook notifications? I never received any notifications on Facebook in 18 months).
Likely unrelated to deliverability issues as well but good to clean out indeed! (Doing in https://github.com/Trustroots/trustroots/pull/2411). FB doesn't support these type of notifications anymore if I remember correct.
We had at least one or two big spam waves recently. While we did deal with it, we should try see if there still is spam being sent from TR profiles, and add better protections/monitoring.
Spam gets flagged at different mail services (gmail, hotmail, etc) and receiving anything else from same IP/domain becomes much harder, and slower.
It pulls our domain's/IPs "reputation" down.
Each sent message and notification generates a "job" in the database, and cleaning these is manual. We should add some code to remove each job after it succeeds, or some daily/weekly cleanup.
I just ran cleanup manually (after fixing the cleanup script in https://github.com/Trustroots/trustroots/pull/2410):
Going to move 891087 documents from agendaJobs to agendaJobsArchived
Source count: 891096
Target count: 125740
Total: 1016836
Fetching docs for transfer...
Processing 891087 docs...
~29.9% (266708/891087)MongoError: new file allocation failure
at Function.create (/srv/ci-versions/20210711_162229/node_modules/mongodb/lib/core/error.js:57:12)
at toError (/srv/ci-versions/20210711_162229/node_modules/mongodb/lib/utils.js:130:22)
at /srv/ci-versions/20210711_162229/node_modules/mongodb/lib/operations/common_functions.js:258:39
at /srv/ci-versions/20210711_162229/node_modules/mongodb/lib/core/connection/pool.js:405:18
at processTicksAndRejections (internal/process/task_queues.js:77:11) {
driver: true,
index: 0,
code: 12520
}
Cursor closed.
Source count: 623648
Target count: 393193
Total: 1016841
✨ Done 267454/891087 documents.
Closing db...
[ ] You can see that it failed halfway through, maybe due to a memory or disk space issues. Will look into that and continue cleanup.
[ ] Ensuring the database is fast to query for Agenda by checking that all the indices are set up and functional would be good, too.
Finally, can't improve what you don't measure. We should be able to monitor better disruptions and unexpected patterns in messages and their deliverability and get notified about issues sooner. We have some alerts in grafana.trustroots.org but they're a bit rudimentary.
Better monitoring of each step of the pipeline (messages created, notifications triggered, notifications jobs processed, emails sent, emails delivered, emails received, emails opened) would also help us detect if some specific part of the whole process is a bottleneck.
Treating the problem as an outright bug is possibly red-herring because the problem is intermittent but good to comb through code paths and look for potential issues.
Looking closer into stats, I think this issue indeed is around spam & deliverability. I just made a new graph that highlights this well:
These are from our email gateway, and not from our server. We get back "bounced", "delayed", "success" etc messages from email companies such as Gmail, Hotmail, and log those in Grafana.
There could of course still be issues at the server, but I would put efforts on spam mitigation next up and try to recover the reputation.
Next steps, in my opinion:
message-throttle
stat)Some other ideas, if the above was not enough 😉:
Thanks for ideas!
Adding that signup spam is actually another significant problem in addition to message spam.
Anatomy of the latest spam attack was:
So there are two aspects to work on separately (solutions might look the same of course):
CAPTCHAs for sending first message in a conversation
Yup. Captchas could be most effective at signup. There are mostly invisible ones these days so the experience doesn't necessarily need to degrade for everyone.
First just want to say sorry about spam problems - wish we lived in a world where that didn't happen!
~I see the messaging limit was merged - has that been deployed and had noticeable effect on the grafana chart that's been screenshotted upthread?~ I was able to check this myself.
One other suggestion:
Just realized the grafana is open access :D
One other thing that looks suspect is the number of long messages relative to short messages shot up since May. I'd guess that spam might be more likely to be stuffed with all kind of garbage links and stuff, but it's just a guess.
Good finding with message length graph 👍
I deployed the limit today, as well basic spam detection which doesn't block anything but can help us see if something like that would even be useful.
I added stat counter for when message throttle is hit, I'll follow it over the next couple of days.
the signup route could be rate-limited pretty aggressively to prevent spam account creation in the first place... if the invite itself is spam, that's the begging of reputation harm and needs to be slowed/fixed
All our API routes are already rate-limited at the Nginx level, but yeah the signup route could be hardened even further.
@simison in the past month and a half I've been moving around asking for hospitality, writing to people, people wrote to me, I added contacts and wrote references. I've been notified by emails always in a few minutes. Of course it doesn't mean that we are not going to experience the issue anymore, but right now the notification system looks reliable.
I get a lot of complaints/questions about not receiving notifications or replies of members because of this problem. Seems to get urgent because it causes people to be annoyed and therefore maybe look for other platforms who don't have this problem...
Curious, for more than a year I had many experiences of this issues, let's say with a ration of 1/2. In this past three months I've been travelling a lot, received requests and messages, wrote some messages and experiences, never a delay. True that my experience is only my experience, but still it's curious 🤔
Still too many bounces (weekly 2K bounce vs 7.5K success, link) but it's a bit better. Spam still needs more attention methinks.
Is also got the question how many times members receive a notification email, since some platforms apparently send multiple. Maybe if we increase this to twice, one of the two always gets through?
Sending a notification two times is more annoying for the recipient than sending a notification and four spam. About delayed notification via email: in july (but occurred also other times) i got often messages with long delays. What is strange is that all the chain of MX that lead to the mailbox I use for TR has NO systemwide spam filtering (except a threshold limit of messages per minute from the same host, to the same mailbox [including the catchall one, so in case of dictionary attack it is immediately mitigated], but it really at worst can delay by three hours, not many days, and only if really there are many messages, not when there is only one message !). Infact the messages looked sent [at least the received: headers said this] by the primary server already with some days delay, once on was tagged spam at user level [spam is not lost, a list of blocked messages is sent weekly so i can look if there is something interesting] because the Date: was more than 10 days off from current date ! So problem was [still is ?] not [only] on server reputation.
Got this comment through support: I've also check the Spam folder and there was nothing there. So, there's really a problem with your outgoing messages. I do get the Newsletter, though. Don't know if it's helpful to know the newsletter does come through...
An idea trough support which could help out with this:
Workaway. Their mail server is generating unique mail addresses
specifically for each user to user conversation: you can answer using the
web interface, but you can *also* answer dircetly to your mail
notification, since the reply-to field is specific to the convo (i.e.
something like [17732ba88cf1@](mailto:17732ba88cf1@)... instaed of [noreply@](mailto:noreply@)...), and allows
the server to deliver the message to the right user (who, in turn, will
receive an email notification *they* can answer to). I'd love to find this
feature on friendly hospex sites: messages are still exchanged over the
platform, but people can just answer from their mail clients if the
connection situation is making it hard to use the web inteface,
authenticity is still reasonably well ensured, since the mail will need to
originate from the same email account that received the message
notification, the one associated to the user account on the platform.
@simison I never ever received any email nor browser notification for any of the messages or requests i received. I received emails when I reset my password though, so the emailing system is somehow working. Does "Community newsletter" flag need to be checked in order to receive emails about new messages and host requests? I checked it but i still do not receive emails. Is there anything we can do to solve the issue?
Describe the bug
Sometimes when a user writes to another one, the recipient isn't notified. No push notification on the browser/Android app, no email, nothing.
To Reproduce
It's not easy to reproduce, but I can say that it happened to me many times and also to some of the users I contacted (or at least they said so). The first one I can remember is a message I sent on February 2021 to a couple that hosted me a few months earlier, they answered three months later with a "sorry, I didn't see the message". More recently I had a conversation with a user in Spain. I asked them for hospitality, they answered... then they didn't see my messages until a few days later, when it was too late. I explained that there is this bug, and during this conversation I could experience that sometimes I received the notifications (push and email), on some others I didn't... and on the other side they told me that the behaviour was the same. Like if in one moment notifications worked for both, and after a while it didn't for any. It could have been a coincidence. A few weeks ago a couple of users wrote me. I saw the message of one of the two more than 24 hours later just because I'm aware of the issue and daily/weekly I log-in to check messages.
Expected behaviour
Users should be notified with push notifications and/or email in a reasonable amount of time (I would say a few seconds for push, a few minutes or maximum hours for emails).
Additional context
It has been reported many times by many people. More than a problem of emails not sent (in this case the bug could be here around https://github.com/Trustroots/trustroots/blob/master/modules/core/server/services/email.server.service.js#L32) I would say it's a problem of notifications not sent (https://github.com/Trustroots/trustroots/blob/master/modules/messages/server/jobs/message-unread.server.job.js#L266). I tried to do a test on my machine with two fake users, and I couldn't reproduce the issue but I noticed a
DeprecationWarning: collection.update is deprecated. Use updateOne, updateMany or bulkWrite instead
. I would say that the first thing to do could be solving that warning and others if any. Then removing dead code (Facebook notifications? I never received any notifications on Facebook in 18 months).Add links to discussion on Slack (or Discourse). https://trustroots.slack.com/archives/C0A3Q15SS/p1614874427039000 https://trustroots.slack.com/archives/C0A3Q15SS/p1620804159014000 https://trustroots.slack.com/archives/C0A3Q15SS/p1626683054005800 https://trustroots.slack.com/archives/C0A3Q15SS/p1630343535000300 https://trustroots.slack.com/archives/C0A3Q15SS/p1632812245003400
Note
When a user writes to other ones and they don't receive answers, they could think that this network is full of phantom profiles and stop to use it. If too often you don't receive an answer or you receive it too late, you stop using the platform for last minute requests. People not receiving notifications probably don't login as often as they would do without the bug, they could forget about having a profile and stop logging in. People that don't log-in after a few months disappear from the map because of the automatic filter "Online in the past 6 months" => less available members => less people using the platform...