Listener skipping some events ?

franckc commented 5 years ago

Austin reported that he accepted offer ID 1-000-1103-26, but it is still showing as status Pending in the DB:

origin=> select id,status from offer where buyer_address='0xb4ffecb1aa10758988b83f7eee5cb3765c133359' and id='1-000-1103-26';
id  status
1-000-1103-26   Pending

Need to investigate why the listener did not process the Accept event.

Seller: 0xb4ffecb1aa10758988b83f7eee5cb3765c133359

franckc commented 5 years ago

CC @tomlinton

tomlinton commented 5 years ago

Yep I think so, also we've had a few more reports of missed ProfilePublished events.

franckc commented 5 years ago

CC @DanielVF In case he has any brilliant ideas of what could be causing this.

Tom's theory is that the listener queries the event cache but the cache does not have the event yet and therefore event is skipped. Perhaps it is a cache synchronization problem ?

I'll spend some time looking into this tomorrow.

mikeshultz commented 5 years ago

I may have seen this while working on a replacement of eventCache. It should be fixed in my newer replacement. This may also be the cause of the fromBlock > toBlock error I saw mentioned in discord.

I think this line doesn't take into account that toBlock may not be 20k blocks forward. So once the listener is caught up, it will likely skip blocks or cause the listener to die with the above error. Restarting should fix it now, or I cold put in a quick PR if you folks would like. I'm probably still a couple days or more out from finishing #1975.

Basically, changing that increment fromBlock += 20000 to fromBlock = toBlock + 1 should fix this issue.

EDIT: Restarting may not work if whatever uses it is loading the previous state from IPFS.

franckc commented 5 years ago

Thanks @mikeshultz ! Yes I think a small PR to fix current code would be awesome since it is causing issues in production.

Now, I'm worried that the change you suggest will cause a huge amount of requests to be issued to Infura. Basically one per block, which is probably not good... ?

franckc commented 5 years ago

One thing I realized looking at this code is that since we are parallelizing calls to Infura, events may not end up being stored in the cache in blockchain order ? https://github.com/OriginProtocol/origin/blob/master/packages/graphql/src/utils/eventCache.js#L25

Wonder if that could also cause some issues...

mikeshultz commented 5 years ago

I'm really not familiar with the listener, or anything that actually implements eventCache right now, so I'd be guessing. The implementation of eventCache partially dictates the behavior of that change. Specifically, whether ipfsEventCache is provided in the config given to eventCache(). Assuming you want to fix any missed events, you'd probably want ignore that IPFS stored state and start from scratch, which no doubt would cause a lot of requests to rebuild the state.

As for order, does that matter?

mikeshultz commented 5 years ago

Actually, I think I'm wrong and jumped the gun. It may have just been an issue from adapting this logic to what I'm building. Once fromBlock > uptoBlock that loop will die and the function will clean up and return.

Sorry about that, I'll look a little further into this.

mikeshultz commented 5 years ago

I still think there's a good chance it's the fromBlock/toBlock tracking but eyeball parsing isn't getting me anywhere.

Any chance there's some debug logs from eventCache? Reproducing this will probably be tricky, and maybe it's better to wait for the replacement?

DanielVF commented 5 years ago

One of the important things about working with Infura is that their backend nodes may not be in sync. You can make a request and find out about an event that happened in block N, then make another request which is routed to a different backend, and that other backend has no clue that block N has happened yet.

I'm not sure if we are are lagging the event listener currently, but doing that by one, two, or three blocks is a quick and dirty way to reduce these to a minimum.

DanielVF commented 5 years ago

@franckc Can you get listener logs from the time that block event went through?

franckc commented 5 years ago

Hmm. I actually don't have the blocknumber. If you guys found it lmk otherwise I'll try to dig it up to get logs around that time...

Couple other things to note:

I'm deploying a few simplifications to the listener main loop (See #1981)
I'm enabling extra logging to get more data (setting DEBUG=event-cache:*)

franckc commented 5 years ago

Regarding possible Infura lag you are right @DanielVF - we are currently setting the --trailBlocks option of the listener to 1 in prod. We could increase it though:

why did we not have the same issue with the old listener ?
increasing --trailBlocks causing extra delays on the indexing side which does have a user impact for he browse and search page of the DApp.

franckc commented 5 years ago

oh and @mikeshultz to answer your question I do think that it is a good properties for the eventCache to return events in ascending order of (blockNumber, logIndex).

DanielVF commented 5 years ago

This offer was accepted in block 7529560

franckc commented 5 years ago

Here are the logs (only look at textpayload):

nulllogs2019-04-08T01-17.txt

Interestingly it looks like the OfferAccepted event did get picked up and processed by the listener. So now it makes me think that the issue is handler_marketplace.js:_getOfferDetails (which makes the graphQL call to fetch the offer data) returned stale data (e.g. offer in created state rather than accepted)...

I wonder if we could be hitting an Infura node that lags when making this call ? https://github.com/OriginProtocol/origin/blob/master/packages/eventsource/src/index.js#L415

DanielVF commented 5 years ago

I don't know if our architecture supports it, but it would be great if the the listener could feed the event cache. When the event listener pulls a new event, it could feed that event to the event cache before processing the event itself. This would keep the event cache constantly up to date with the latest events that the listener has seen.

This would also reduce the number of queries to the blockchain.

[edit: I guess this would not solve the problem of an old node for the things that are non-event based.]

franckc commented 5 years ago

Since the new event-cache got integrated with the listener, we haven't seen this issue. Closing for now.

franckc commented 5 years ago

This is actually still happening :(

Event ListingCreated was skipped at block 7643918 Here are the logs:

franckc commented 5 years ago

The logs in the example above show the event-cache did not get the event...

I suspect this could be caused by Infura node lagging. e.g. we issue query for events in block 7643918 but the Infura node we hit hasn't indexed that data yet and returns no event. Some options to mitigate this:

have event-cache trail by N block
maybe there is a way to get in the response the latest block known by the Infura node we hit ?
switch to Alchemy in prod, hoping they do a better job :)

@mikeshultz any thoughts ?

mikeshultz commented 5 years ago

@franckc Do you have the tx in question, by chance?

mikeshultz commented 5 years ago

Nevermind, found it. 0x06c09466a44e17d7e5abe00b0a49a82ae3eeef2e73bba7958d6e20346bc03353

mikeshultz commented 5 years ago

@franckc Can you get me a full-ish log dump for this timeframe?

mikeshultz commented 5 years ago

Added PR #2152 that adds a check and some debug output. There's definitely a chance that the websocket is giving new block headers while subsequent calls could be from a node a block or more behind. All their stuff is clustered and there's really no guarantee of consistency. There's not even a guarantee this will point out the error since the eth_blockNumber call might not be sent to the same node that gets the eth_getLogs call, but it could maybe shed some light.

There's also a mild chance of a chain reorg causing issues with this listener, but I'm not sure how to even detect that.

The best long-term option might be to continually do sanity checks on data up to X blocks behind the leading block, but that's a non-trivial change.

DanielVF commented 5 years ago

If the problem is actually inconsistent reads from infura, (I'm not convinced yet), then pointing the listener at one of our own geth instances should definitely solve that problem.

This certainly introduces it's own long term infrastructure load. However we could also run an experimental parallel listener temporarily on our node, and compare results.

franckc commented 5 years ago

@mikeshultz Here are the logs. Note it is in Google cloud format but hopefully you can make sense of it... Let me know otherwise... https://gist.github.com/franckc/215c4ddae24ade00b88e9d9a45c81c04

mikeshultz commented 5 years ago

The websocket header thing is irrelevant with the listener(only used with graphql). I'm thinking it's unlikely that debug stuff I added will diagnose this issue.

...where are these logs coming from? Was just going through them to kind of trace the codepath and I'm not finding any output matching this:

"textPayload": "\u001b[32;22m[INFO]  \u001b[39;1mevent-listener\u001b[0m: Querying events from 7643917 up to 7643919\n",

This is the closest but it's not a match. Is this service up to date with master?

franckc commented 5 years ago

@mikeshultz I merged #2152 to master and I'm going to deploy it to prod this morning...

As to logs, listener is running in prod what is on the stable branch. Last night I landed a small PR that slightly changes the logging in master. See #2150

mikeshultz commented 5 years ago

Added PR #2154 to use the block number from the latest known event. This doesn't exactly make the from/to block logic any clearer, but should help prevent any issues from out of sync nodes if enabled.

Feel free to wait to merge that until after we get more info from that extra debug logging.

I'd really like to greatly simplify this logic but all this extra logic is done for performance reasons(request minimization) which seems to be a high priority. If there continues to be issues, maybe we can go that route and worry about perf later.

franckc commented 5 years ago

Chatted with @mikeshultz this morning. He came up with this superb plan:

refactor listener to make it contract agnostic
have separate cursor per contract
add a script for consistency checking between DB and blockchain so we can know for sure if we fixed the issue as opposed to relying on bug reports from users
clean up trailing block in listener since it is not relevant anymore with new event-cache

OriginProtocol / origin

Listener skipping some events ? #1972