Open franckc opened 5 years ago
CC @tomlinton
Yep I think so, also we've had a few more reports of missed ProfilePublished events.
CC @DanielVF In case he has any brilliant ideas of what could be causing this.
Tom's theory is that the listener queries the event cache but the cache does not have the event yet and therefore event is skipped. Perhaps it is a cache synchronization problem ?
I'll spend some time looking into this tomorrow.
I may have seen this while working on a replacement of eventCache
. It should be fixed in my newer replacement. This may also be the cause of the fromBlock > toBlock
error I saw mentioned in discord.
I think this line doesn't take into account that toBlock
may not be 20k blocks forward. So once the listener is caught up, it will likely skip blocks or cause the listener to die with the above error. Restarting should fix it now, or I cold put in a quick PR if you folks would like. I'm probably still a couple days or more out from finishing #1975.
Basically, changing that increment fromBlock += 20000
to fromBlock = toBlock + 1
should fix this issue.
EDIT: Restarting may not work if whatever uses it is loading the previous state from IPFS.
Thanks @mikeshultz ! Yes I think a small PR to fix current code would be awesome since it is causing issues in production.
Now, I'm worried that the change you suggest will cause a huge amount of requests to be issued to Infura. Basically one per block, which is probably not good... ?
One thing I realized looking at this code is that since we are parallelizing calls to Infura, events may not end up being stored in the cache in blockchain order ? https://github.com/OriginProtocol/origin/blob/master/packages/graphql/src/utils/eventCache.js#L25
Wonder if that could also cause some issues...
I'm really not familiar with the listener, or anything that actually implements eventCache
right now, so I'd be guessing. The implementation of eventCache
partially dictates the behavior of that change. Specifically, whether ipfsEventCache
is provided in the config
given to eventCache()
. Assuming you want to fix any missed events, you'd probably want ignore that IPFS stored state and start from scratch, which no doubt would cause a lot of requests to rebuild the state.
As for order, does that matter?
Actually, I think I'm wrong and jumped the gun. It may have just been an issue from adapting this logic to what I'm building. Once fromBlock > uptoBlock
that loop will die and the function will clean up and return.
Sorry about that, I'll look a little further into this.
I still think there's a good chance it's the fromBlock
/toBlock
tracking but eyeball parsing isn't getting me anywhere.
Any chance there's some debug logs from eventCache
? Reproducing this will probably be tricky, and maybe it's better to wait for the replacement?
One of the important things about working with Infura is that their backend nodes may not be in sync. You can make a request and find out about an event that happened in block N, then make another request which is routed to a different backend, and that other backend has no clue that block N has happened yet.
I'm not sure if we are are lagging the event listener currently, but doing that by one, two, or three blocks is a quick and dirty way to reduce these to a minimum.
@franckc Can you get listener logs from the time that block event went through?
Hmm. I actually don't have the blocknumber. If you guys found it lmk otherwise I'll try to dig it up to get logs around that time...
Couple other things to note:
Regarding possible Infura lag you are right @DanielVF - we are currently setting the --trailBlocks option of the listener to 1 in prod. We could increase it though:
oh and @mikeshultz to answer your question I do think that it is a good properties for the eventCache to return events in ascending order of (blockNumber, logIndex).
This offer was accepted in block 7529560
Here are the logs (only look at textpayload):
Interestingly it looks like the OfferAccepted event did get picked up and processed by the listener. So now it makes me think that the issue is handler_marketplace.js:_getOfferDetails (which makes the graphQL call to fetch the offer data) returned stale data (e.g. offer in created state rather than accepted)...
I wonder if we could be hitting an Infura node that lags when making this call ? https://github.com/OriginProtocol/origin/blob/master/packages/eventsource/src/index.js#L415
I don't know if our architecture supports it, but it would be great if the the listener could feed the event cache. When the event listener pulls a new event, it could feed that event to the event cache before processing the event itself. This would keep the event cache constantly up to date with the latest events that the listener has seen.
This would also reduce the number of queries to the blockchain.
[edit: I guess this would not solve the problem of an old node for the things that are non-event based.]
Since the new event-cache got integrated with the listener, we haven't seen this issue. Closing for now.
This is actually still happening :(
Event ListingCreated was skipped at block 7643918 Here are the logs:
The logs in the example above show the event-cache did not get the event...
I suspect this could be caused by Infura node lagging. e.g. we issue query for events in block 7643918 but the Infura node we hit hasn't indexed that data yet and returns no event. Some options to mitigate this:
@mikeshultz any thoughts ?
@franckc Do you have the tx in question, by chance?
Nevermind, found it. 0x06c09466a44e17d7e5abe00b0a49a82ae3eeef2e73bba7958d6e20346bc03353
@franckc Can you get me a full-ish log dump for this timeframe?
Added PR #2152 that adds a check and some debug output. There's definitely a chance that the websocket is giving new block headers while subsequent calls could be from a node a block or more behind. All their stuff is clustered and there's really no guarantee of consistency. There's not even a guarantee this will point out the error since the eth_blockNumber
call might not be sent to the same node that gets the eth_getLogs
call, but it could maybe shed some light.
There's also a mild chance of a chain reorg causing issues with this listener, but I'm not sure how to even detect that.
The best long-term option might be to continually do sanity checks on data up to X blocks behind the leading block, but that's a non-trivial change.
If the problem is actually inconsistent reads from infura, (I'm not convinced yet), then pointing the listener at one of our own geth instances should definitely solve that problem.
This certainly introduces it's own long term infrastructure load. However we could also run an experimental parallel listener temporarily on our node, and compare results.
@mikeshultz Here are the logs. Note it is in Google cloud format but hopefully you can make sense of it... Let me know otherwise... https://gist.github.com/franckc/215c4ddae24ade00b88e9d9a45c81c04
The websocket header thing is irrelevant with the listener(only used with graphql). I'm thinking it's unlikely that debug stuff I added will diagnose this issue.
...where are these logs coming from? Was just going through them to kind of trace the codepath and I'm not finding any output matching this:
"textPayload": "\u001b[32;22m[INFO] \u001b[39;1mevent-listener\u001b[0m: Querying events from 7643917 up to 7643919\n",
This is the closest but it's not a match. Is this service up to date with master?
@mikeshultz I merged #2152 to master and I'm going to deploy it to prod this morning...
As to logs, listener is running in prod what is on the stable branch. Last night I landed a small PR that slightly changes the logging in master. See #2150
Added PR #2154 to use the block number from the latest known event. This doesn't exactly make the from/to block logic any clearer, but should help prevent any issues from out of sync nodes if enabled.
Feel free to wait to merge that until after we get more info from that extra debug logging.
I'd really like to greatly simplify this logic but all this extra logic is done for performance reasons(request minimization) which seems to be a high priority. If there continues to be issues, maybe we can go that route and worry about perf later.
Chatted with @mikeshultz this morning. He came up with this superb plan:
Austin reported that he accepted offer ID 1-000-1103-26, but it is still showing as status Pending in the DB:
Need to investigate why the listener did not process the Accept event.
Seller: 0xb4ffecb1aa10758988b83f7eee5cb3765c133359