Users (sometimes) are not notified when they received a message

nonno commented 3 years ago

Describe the bug

Sometimes when a user writes to another one, the recipient isn't notified. No push notification on the browser/Android app, no email, nothing.

To Reproduce

It's not easy to reproduce, but I can say that it happened to me many times and also to some of the users I contacted (or at least they said so). The first one I can remember is a message I sent on February 2021 to a couple that hosted me a few months earlier, they answered three months later with a "sorry, I didn't see the message". More recently I had a conversation with a user in Spain. I asked them for hospitality, they answered... then they didn't see my messages until a few days later, when it was too late. I explained that there is this bug, and during this conversation I could experience that sometimes I received the notifications (push and email), on some others I didn't... and on the other side they told me that the behaviour was the same. Like if in one moment notifications worked for both, and after a while it didn't for any. It could have been a coincidence. A few weeks ago a couple of users wrote me. I saw the message of one of the two more than 24 hours later just because I'm aware of the issue and daily/weekly I log-in to check messages.

Expected behaviour

Users should be notified with push notifications and/or email in a reasonable amount of time (I would say a few seconds for push, a few minutes or maximum hours for emails).

Additional context

It has been reported many times by many people. More than a problem of emails not sent (in this case the bug could be here around https://github.com/Trustroots/trustroots/blob/master/modules/core/server/services/email.server.service.js#L32) I would say it's a problem of notifications not sent (https://github.com/Trustroots/trustroots/blob/master/modules/messages/server/jobs/message-unread.server.job.js#L266). I tried to do a test on my machine with two fake users, and I couldn't reproduce the issue but I noticed a DeprecationWarning: collection.update is deprecated. Use updateOne, updateMany or bulkWrite instead. I would say that the first thing to do could be solving that warning and others if any. Then removing dead code (Facebook notifications? I never received any notifications on Facebook in 18 months).

Add links to discussion on Slack (or Discourse). https://trustroots.slack.com/archives/C0A3Q15SS/p1614874427039000 https://trustroots.slack.com/archives/C0A3Q15SS/p1620804159014000 https://trustroots.slack.com/archives/C0A3Q15SS/p1626683054005800 https://trustroots.slack.com/archives/C0A3Q15SS/p1630343535000300 https://trustroots.slack.com/archives/C0A3Q15SS/p1632812245003400

Note

When a user writes to other ones and they don't receive answers, they could think that this network is full of phantom profiles and stop to use it. If too often you don't receive an answer or you receive it too late, you stop using the platform for last minute requests. People not receiving notifications probably don't login as often as they would do without the bug, they could forget about having a profile and stop logging in. People that don't log-in after a few months disappear from the map because of the automatic filter "Online in the past 6 months" => less available members => less people using the platform...

simison commented 3 years ago

Thanks for summarizing and collecting links to convos! You are right, I think message deliverability is most important to solve at the moment. Just nobody has had time or needed knowledge.

It's intermittent, e.g. email notifications worked fine fine for me this week, but it's clear from feedback they're not performing well. Locally I haven't been able to reproduce regardless of few attempts, and errors logs don't offer any hints.

I tried to do a test on my machine with two fake users, and I couldn't reproduce the issue but I noticed a DeprecationWarning: collection.update is deprecated. Use updateOne, updateMany or bulkWrite instead.

These are MongoDB API warnings for upgrading the code but won't cause issues until libraries are updated to more recent. Good to update those methods of course. :-)

I would say that the first thing to do could be solving that warning and others if any. Then removing dead code (Facebook notifications? I never received any notifications on Facebook in 18 months).

Likely unrelated to deliverability issues as well but good to clean out indeed! (Doing in https://github.com/Trustroots/trustroots/pull/2411). FB doesn't support these type of notifications anymore if I remember correct.

Some thoughts/todos:

Spam

We had at least one or two big spam waves recently. While we did deal with it, we should try see if there still is spam being sent from TR profiles, and add better protections/monitoring.

Spam gets flagged at different mail services (gmail, hotmail, etc) and receiving anything else from same IP/domain becomes much harder, and slower.

It pulls our domain's/IPs "reputation" down.

Keeping the job DB clean

Each sent message and notification generates a "job" in the database, and cleaning these is manual. We should add some code to remove each job after it succeeds, or some daily/weekly cleanup.

I just ran cleanup manually (after fixing the cleanup script in https://github.com/Trustroots/trustroots/pull/2410):

Going to move 891087 documents from agendaJobs to agendaJobsArchived

Source count: 891096
Target count: 125740
Total: 1016836

Fetching docs for transfer...

Processing 891087 docs...
~29.9% (266708/891087)MongoError: new file allocation failure
    at Function.create (/srv/ci-versions/20210711_162229/node_modules/mongodb/lib/core/error.js:57:12)
    at toError (/srv/ci-versions/20210711_162229/node_modules/mongodb/lib/utils.js:130:22)
    at /srv/ci-versions/20210711_162229/node_modules/mongodb/lib/operations/common_functions.js:258:39
    at /srv/ci-versions/20210711_162229/node_modules/mongodb/lib/core/connection/pool.js:405:18
    at processTicksAndRejections (internal/process/task_queues.js:77:11) {
  driver: true,
  index: 0,
  code: 12520
}

Cursor closed.

Source count: 623648
Target count: 393193
Total: 1016841

✨  Done 267454/891087 documents.
Closing db...

[ ] You can see that it failed halfway through, maybe due to a memory or disk space issues. Will look into that and continue cleanup.
[ ] Ensuring the database is fast to query for Agenda by checking that all the indices are set up and functional would be good, too.

Updating Agenda

[ ] We're on an ancient version of Agenda, which is known to be unreliable, and lots of bugs have been fixed since. We should update to the latest v4. (PR https://github.com/Trustroots/trustroots/pull/2415)

Monitoring

Finally, can't improve what you don't measure. We should be able to monitor better disruptions and unexpected patterns in messages and their deliverability and get notified about issues sooner. We have some alerts in grafana.trustroots.org but they're a bit rudimentary.

Better monitoring of each step of the pipeline (messages created, notifications triggered, notifications jobs processed, emails sent, emails delivered, emails received, emails opened) would also help us detect if some specific part of the whole process is a bottleneck.

Bugs

Treating the problem as an outright bug is possibly red-herring because the problem is intermittent but good to comb through code paths and look for potential issues.

simison commented 3 years ago

Looking closer into stats, I think this issue indeed is around spam & deliverability. I just made a new graph that highlights this well:

These are from our email gateway, and not from our server. We get back "bounced", "delayed", "success" etc messages from email companies such as Gmail, Hotmail, and log those in Grafana.

There could of course still be issues at the server, but I would put efforts on spam mitigation next up and try to recover the reputation.

simison commented 3 years ago

Next steps, in my opinion:

[ ] Dig through data to see if there are active spammers on TR right now.
[x] Work on adding message limitations, started at https://github.com/Trustroots/trustroots/pull/1583
[ ] Add Grafana dashboard to track limit hits (message-throttle stat)
[ ] Add automation that hides users automatically over some limits
[ ] Add alerts to admins (email, Slack, Grafana) when someone is triggering these limits
[ ] Add tools to clean up messages en-masse, including already queued Agenda jobs. Currently, if we remove user who spammed, the messages will still continue rolling out.
[ ] Add alerts on "spam complaint" messages received from email gateway, and ability to track which exact user/message was flagged as spam.
[x] Add basic spam detection, for now only flagging spam so that we can see if it captures anything. https://github.com/Trustroots/trustroots/pull/2417
[ ] Consider changing sending email address, or IP. Would reset our reputation but might be still better than now kinda spoiled reputation.

mariha commented 3 years ago

Some other ideas, if the above was not enough 😉:

CAPTCHAs for sending first message in a conversation
Messages that are broadcasted to the hosts subscribed in some area, to complement limits on direct messages (we'd like to do it with ActivityPub). It would be easier to broadcast messages and spam but also easier to detect and block those who abuse it
For spam filtering, if someone was looking for a nice project:
- detect unusual patterns in messages activity, ex. sudden pick in total number of messages being sent or in unsuccessful deliveries (like above), a user that sends a lot of messages, implemented either with hardcoded rules and thresholds or some other anomalies detection algorithms
- filter based on content of a message, either with bayesian classifier or simple rules based checklist, possibly with a threshold
Truck where a new user comes from, from an invitation or open registration, and use that to inform spam filtering (also good for user's credibility and security)

simison commented 3 years ago

Thanks for ideas!

Adding that signup spam is actually another significant problem in addition to message spam.

Anatomy of the latest spam attack was:

Create a TON (hundreds of thousands) accounts with real email addresses harvested from internet
We end up sending tons of "welcome! confirm your email" emails
Spammer waits for a few of those to click "confirm" in the email
Spammer starts sending messages via few confirmed profiles

So there are two aspects to work on separately (solutions might look the same of course):

Spammy signups
Spammy messages

CAPTCHAs for sending first message in a conversation

Yup. Captchas could be most effective at signup. There are mostly invisible ones these days so the experience doesn't necessarily need to degrade for everyone.

karlkeefer commented 3 years ago

First just want to say sorry about spam problems - wish we lived in a world where that didn't happen!

~I see the messaging limit was merged - has that been deployed and had noticeable effect on the grafana chart that's been screenshotted upthread?~ I was able to check this myself.

One other suggestion:

the signup route could be rate-limited pretty aggressively to prevent spam account creation in the first place... if the invite itself is spam, that's the begging of reputation harm and needs to be slowed/fixed

karlkeefer commented 3 years ago

Just realized the grafana is open access :D

One other thing that looks suspect is the number of long messages relative to short messages shot up since May. I'd guess that spam might be more likely to be stuffed with all kind of garbage links and stuff, but it's just a guess.

simison commented 3 years ago

Good finding with message length graph 👍

I deployed the limit today, as well basic spam detection which doesn't block anything but can help us see if something like that would even be useful.

I added stat counter for when message throttle is hit, I'll follow it over the next couple of days.

simison commented 3 years ago

the signup route could be rate-limited pretty aggressively to prevent spam account creation in the first place... if the invite itself is spam, that's the begging of reputation harm and needs to be slowed/fixed

All our API routes are already rate-limited at the Nginx level, but yeah the signup route could be hardened even further.

nonno commented 2 years ago

@simison in the past month and a half I've been moving around asking for hospitality, writing to people, people wrote to me, I added contacts and wrote references. I've been notified by emails always in a few minutes. Of course it doesn't mean that we are not going to experience the issue anymore, but right now the notification system looks reliable.

TMC89 commented 2 years ago

I get a lot of complaints/questions about not receiving notifications or replies of members because of this problem. Seems to get urgent because it causes people to be annoyed and therefore maybe look for other platforms who don't have this problem...

nonno commented 2 years ago

Curious, for more than a year I had many experiences of this issues, let's say with a ration of 1/2. In this past three months I've been travelling a lot, received requests and messages, wrote some messages and experiences, never a delay. True that my experience is only my experience, but still it's curious 🤔

simison commented 2 years ago

Still too many bounces (weekly 2K bounce vs 7.5K success, link) but it's a bit better. Spam still needs more attention methinks.

TMC89 commented 2 years ago

Is also got the question how many times members receive a notification email, since some platforms apparently send multiple. Maybe if we increase this to twice, one of the two always gets through?

Daucus-gh commented 2 years ago

Sending a notification two times is more annoying for the recipient than sending a notification and four spam. About delayed notification via email: in july (but occurred also other times) i got often messages with long delays. What is strange is that all the chain of MX that lead to the mailbox I use for TR has NO systemwide spam filtering (except a threshold limit of messages per minute from the same host, to the same mailbox [including the catchall one, so in case of dictionary attack it is immediately mitigated], but it really at worst can delay by three hours, not many days, and only if really there are many messages, not when there is only one message !). Infact the messages looked sent [at least the received: headers said this] by the primary server already with some days delay, once on was tagged spam at user level [spam is not lost, a list of blocked messages is sent weekly so i can look if there is something interesting] because the Date: was more than 10 days off from current date ! So problem was [still is ?] not [only] on server reputation.

TMC89 commented 2 years ago

Got this comment through support: I've also check the Spam folder and there was nothing there. So, there's really a problem with your outgoing messages. I do get the Newsletter, though. Don't know if it's helpful to know the newsletter does come through...

TMC89 commented 2 years ago

An idea trough support which could help out with this:

Workaway. Their mail server is generating unique mail addresses 
specifically for each user to user conversation: you can answer using the 
web interface, but you can *also* answer dircetly to your mail 
notification, since the reply-to field is specific to the convo (i.e. 
something like [17732ba88cf1@](mailto:17732ba88cf1@)... instaed of [noreply@](mailto:noreply@)...), and allows 
the server to deliver the message to the right user (who, in turn, will 
receive an email notification *they* can answer to). I'd love to find this 
feature on friendly hospex sites: messages are still exchanged over the 
platform, but people can just answer from their mail clients if the 
connection situation is making it hard to use the web inteface, 
authenticity is still reasonably well ensured, since the mail will need to 
originate from the same email account that received the message 
notification, the one associated to the user account on the platform.

nicoursi commented 2 years ago

@simison I never ever received any email nor browser notification for any of the messages or requests i received. I received emails when I reset my password though, so the emailing system is somehow working. Does "Community newsletter" flag need to be checked in order to receive emails about new messages and host requests? I checked it but i still do not receive emails. Is there anything we can do to solve the issue?

Trustroots / trustroots