Auto Response fail for first message

lizhutter commented 3 years ago

When we send the first message in the chat plugin, an auto response does not get through because the conversation is not there yet.

screenshot_2021-02-22_at_10 36 14

/messages.send returns 404

screenshot_2021-02-22_at_10 37 54

After waiting a few seconds, the next message works perfectly sending the auto response.

The webhook contains a conversation as so:

2021-02-22T09:36:44.691759+00:00 app[web.1]: {
2021-02-22T09:36:44.691772+00:00 app[web.1]: channel_id: '4bca1400-cebe-5f90-980a-60c8c3f3028f',
2021-02-22T09:36:44.691773+00:00 app[web.1]: conversation_id: '90e77bf7-3f0e-5739-9dfe-6e367617d80b',
2021-02-22T09:36:44.691774+00:00 app[web.1]: message: {
2021-02-22T09:36:44.691776+00:00 app[web.1]: content: { text: 'kgjkjg' },
2021-02-22T09:36:44.691776+00:00 app[web.1]: delivery_state: 'delivered',
2021-02-22T09:36:44.691776+00:00 app[web.1]: id: '890fb216-0c94-4bd5-8a0b-977d61bd5548',
2021-02-22T09:36:44.691777+00:00 app[web.1]: sender_type: 'source_contact',
2021-02-22T09:36:44.691777+00:00 app[web.1]: sent_at: '2021-02-22T09:36:44.666Z',
2021-02-22T09:36:44.691778+00:00 app[web.1]: source: 'chat_plugin'
2021-02-22T09:36:44.691779+00:00 app[web.1]: }

lucapette commented 3 years ago

From the look of it, it feels like the usual "it's a distributed system so conversations may or may not be there is to be expected". Clients must implement some retry logic to mitigate the problem. I say it "looks that way" because there's not enough information to infer any pattern here. As it is, I will close the issue. Please feel free to reopen if you can provide more information (which should support the idea we can improve the situation. On that note, that's also a valid reason to close the ticket as it's not clear what's the request being made or the expected behaviour)

steffh commented 3 years ago

@lucapette I can totally relate and understand from your perspective.

It's just a bit odd that we send out a webhook with a conversation_id and then when someone sends back a message to such conversation we reply with 404 - conversation doesn't exist.

This appears to a developer as if our system was broken and is not an edge case, but reproducible.

To rephrase the issue - would it be possible to wait with sending out the webhook until we are sure the conversation has been created? Or speed up the creation of conversations somehow by reducing the commit frequency? It must be sth like 2-3 seconds at the moment, so the webhook + sending back an automated response is much faster.

Just telling the client has to solve this, doesn't fully give it justice as autoresponders are not a super uncommon use case.

lucapette commented 3 years ago

@lucapette I can totally relate and understand from your perspective.

It's just a bit odd that we send out a webhook with a conversation_id and then when someone sends back a message to such conversation we reply with 404 - conversation doesn't exist.

I get it looks that way because we return 404. It's a bit of a catch-22 problem. Because the system can't distinguish between "this conversation doesn't exist" and "it's not there yet".

It's the nature of the system itself that makes it work like that. The webhook, websocket, api components all consume data from the same kafka topics. They build their own version of reality and provide their features. From a strictly architectural perspective, we've been discussing how to "merge" webhook and websocket as conceptually they provide the exact same feature. The only difference is the means of communication with the consuming clients. But even if we improve the situation here and refactor the system so that we effectively use the same "events" at kafka level (to be more specific, the same topology) we'd still have two different process running in two different pods (probably on two different nodes of the same kubernetes cluster) which means we will end up is a somewhat similar situation.

This appears to a developer as if our system was broken and is not an edge case, but reproducible.

If it's reproducible, can you provide the steps? That would help! The comments provided lead to an interesting conversation but they're not very actionable (apart from the fact they allow me to write down how parts of our system works. I'm sure I'll end up digging this issue once I go back to the architecture docs)

To rephrase the issue - would it be possible to wait with sending out the webhook until we are sure the conversation has been created? Or speed up the creation of conversations somehow by reducing the commit frequency? It must be sth like 2-3 seconds at the moment, so the webhook + sending back an automated response is much faster.

This is the obvious confusion. The webhook can't send out an event for something that doesn't exist yet so what's happening here is that the webhook is faster than the system responsible for sending messages. Furthermore, the commit frequency right now is a global configuration and affects all the components equally. We've been improving airy.yml config file... but there's a long way to go to offer the right granularity. We should probably work on that after we're done with the airy create milestone.

Waiting for things is very very expensive (both in terms of increased architectural complexity and long-term resource usage in a system this nature). So I would exclude that upfront as a solution for the problem.

There's a few options to mitigate the problem that we should consider:

We can "move" /messages.send out of the api communication. The fact it's there is the main reason why this topology is slower that the webhook one (which is really lightweight so it's always going to be fast... probably it will even get faster once we find the right way to "merge" it with the websocket component). Tradeoffs:
- it would probably lead to a slightly larger runtime as we'd introduce a new pod
- or we can put it in the admin component. Even though that feels strange from a code organisation perspective
We can change the way the message sending works by not checking for the conversation id and requiring the channel id in the api. It would also mitigate the problem but I'm unsure about the tradeoffs:
- the api for sending messages would be much odder (why would we ask them to provide a channel id?).
- Of course if we don't check for the conversation, we may risk "pending messages" to stay in that state in the system indefinitely (as we'd open the door to send non existing conversation ids)

Between the two, I think I would vote for the former (I like the tradeoffs better because we would not change the public api) but curious to hear @chrismatix and @paulodiniz opinion on this as well.

Just telling the client has to solve this, doesn't fully give it justice as autoresponders are not a super uncommon use case.

It's not "just telling". It's really a necessary (and dare to say pretty basic) need on the client side when dealing with a system of this nature. The @airyhq/codeowners-frontend has discussed already the idea of building a retry mechanism inside the the typescript library. As there are no docs yet, the client is "not official" so to speak so short term we introduced a little retry mechanism in the UI (which is after all "just a client" from the system perspective) but planned (to be fair there is no issue about this yet... but I don't image us forgetting about it) already to offer "official retries" via our own typescript client.

chrismatix commented 3 years ago

I agree with the system breakdown that @lucapette gave and I tend to opt for the former option and would like to add another point in its support: If we ever opt to build an Airy to Airy source (or say an email source) sending messages would be the same as creating a conversation. So a pre-existing source contact is not a necessary requirement for all sources. Although you could argue that those endpoints can be different.

Also, the drawback is also not as bad as it seems: Yes API users can create dangling conversations for sources if the conversation id they provide never shows up eventually. But then how did they get the conversation id in the first place? For the bug to occur the id would have to be made up or wrong and we could provide tools for cleaning up the data.

steffh commented 3 years ago

Thank you everyone for your detailed explanations. I have solved this now in the relevant custom apps by implementing a queue for sending messages and a retry strategy with exponential backoffs (so we try again after 1s, 2s, 4s, 8s, 16s). From what I have seen so far the conversation was always there with the first or second retry.

I think creating official SDKs for Javascript/Typescript via NPM, as well Cocoapods/iOS and Android that include some kind of retry mechanism makes a lot of sense, so we should do that anyways.

About the concrete second option suggestion you made @lucapette, I think we should discuss this once more because we may only jump to the conclusion that a channel_id would be always required, when in fact we might not really need it. I have a feeling this would be a larger discussion and would therefore suggest to write a longer piece about this thought, and then we can fight about it. ;) So let me write it in a separate research ticket and we can start to discuss it there.

I also think @chrismatix thought about Airy-to-Airy conversations is really interesting in that regard and I would agree that a pre-existing source contact is not necessarily a requirement for all sources.

What might also contribute to the conversation is looking at all the other Chat APIs / messaging platforms and how they tend to solve the problem. Usually the way they do it from my perspective is splitting things up in several steps:

the clients have to first create a conversation and register the participants of such new conversation (usually between 2 and x)
only then they receive a conversation_id (which some call channel_id to add to the confusion because in their world a channel is a special type of conversation)
and then they can then send messages to the conversation by specifying the sender of such message (or depending on the capabilities of the source create new threads within the conversation by specifying a previous message as the nexus of the side conversation)

We instead tend to ingest messages and create the conversations they belong to on the fly, so the conversation_id gets generated at a different place and might not be there before the first message gets into the system. That design can be really clever as we need not necessarily keep a large list of all conversations that ever existed and can just start to work with the data as we get it from Facebook, Google etc..

However, especially for the iOS and Android SDKs but also for a chat plugin installed on a website after a login where people already have identified a user and have a stable identity, we need to find a way to create and fetch a conversation for such external user id, i.e. start a conversation by creating a source contact which could already return the conversation_id to post messages to in the response (or to even accept the external_user_id as an alternative to the conversation_id and start directly with ingesting the first message).

An even more radical approach would be to say as long as there is a conversation_id, even if it doesn't exist yet and people send messages to it, we should be able to process that. Of course we couldn't "send/sync" it somewhere, but as in the example that @chrismatix made above at least for an Airy-to-Airy conversation that wouldn't be the goal, but just holding the messages and grouping them together, so another Airy user can look at them and contribute.

A tube is also open at both ends...

It might be true for Facebook / Google conversations, but it's definitely not the case for phone numbers, email addresses, user ids, etc. that we can only ingest messages and conversational data from one end.

lucapette commented 3 years ago

About the concrete second option suggestion you made @lucapette, I think we should discuss this once more because we may only jump to the conclusion that a channel_id would be always required, when in fact we might not really need it. I have a feeling this would be a larger discussion and would therefore suggest to write a longer piece about this thought, and then we can fight about it. ;) So let me write it in a separate research ticket and we can start to discuss it there.

I actually meant the exact opposite which is why I don't like the api proposal to mitigate the problem.

I will extract from this conversation a new issue early next week so I will close this one

airyhq / airy

Auto Response fail for first message #1052