Closed nikolaiwarner closed 6 years ago
this is the cabal nick tried to join:
cabal://3115ddead69876368789e03101ab5136ccd445449024f08b3318986690467905
i created it, so we definitely shouldn't be as many peers as the sidebar is showing haha. i only see myself in it, with my messages
I'm able to join & sync down that cabal fyi, but nobody on the public one
me and @nikolaiwarner are experimenting:
@nikolaiwarner: it sounds like you were running an old client or something? how did you fix this on your end? I don't think others were affected by this? @cblgh?
When I look at my cabal directory for 00794539a8ce6bed76e40b9d259666303d39271da66140282bfbce76fd9a4434
I see 776 directories, which means 776 hypercores. The maximum number of hypercores that hypercore-protocol replicates over one stream is 128 (why?). It emits an error, but hypercore doesn't listen for it, so it gets swallowed silently. This answers one question, which was "why can't we replicate eachother's messages?".
The other question is, "how did we end up with 776 hypercores on this cabal?". At first I thought it was an old client trying to connect, and multifeed was interpreting the hyperdb protocol data as noise and created lots of junk feeds, but I can't reproduce that. Maybe somebody has been generating tons of new keys (intentionally or not) on our public cabals.
Either way, it seems important that
@noffle FWIW I saw many (116) directories being generated locally, all at once, in a case where the cabal key was private and only 2 clients:
@fenwick67 I think this has to do with an old (hyperdb) client trying to participate in a newer (kappa-core) cabal, but this remains unproven.
@fenwick67 I think this has to do with an old (hyperdb) client trying to participate in a newer (kappa-core) cabal, but this remains unproven.
which has a wip fix in https://github.com/cabal-club/cabal-core/pull/22 :D!
I can't reproduce this. I created a kappa cabal and had a hyperdb client try to join it. It created empty users in the user sidebar, but my local cabal dir wasn't getting filled with empty hypercores like we see on the broken cabals.
(I'm using the latest cabal CLI for all of this:)
I cleared out .cabal
and tried again with the public cabal cabal://00794539...
. I got 43 feeds in .cabal
but no chat messages show in the UI. I see one peer in the sidebar, 00794539
, which is the same as the cabal's hash! Is that the "fake" hypercore?
Those 43 feeds come from out there on the internet somewhere. Doing this experiment with a couple of local peers and no internet connection, I don't get them (nor the mystery peer 00794539).
I created a fresh cabal of my own and added a couple of local peers all running inside my machine. It all works as expected with cabal CLI, and no extra feeds appear. I also joined with the latest cabal-desktop and though it had troubles of its own [1], no extra feeds appeared.
[1] When it joined, it showed preexisting messages as all coming from conspirator
. After a couple of restarts and new chat messages, it showed everything correctly.
@cinnamon-bun
(I'm using the latest cabal CLI for all of this:)
I actually just published some multifeed-index fixes very recently (within the hour). I wonder if you tested before I pushed those or since. It might be that these fixes repaired the issues you were seeing before!
I see one peer in the sidebar, 00794539, which is the same as the cabal's hash! Is that the "fake" hypercore?
The peer who's key matches the cabal is the original creator of that cabal. (maybe me?)
Hey debug friends! I created a new test cabal for us to try and break with the latest cabal-cli client:
cabal://58dc528ab340938eb66a29f80583ca1b0dcb9034ee78875ac695fbc8359b3581
I pushed some fixes to multifeed-index
that explain some weird race conditions around messages not appearing, so I'm keen to see if we can break another cabal and do forensics on it if so.
Feel free to spam the heck outta this & do whatever weird stuff you'd like (as long as you document what you did!)
I having the same issue, they only peer I see in the main cabal beyond myself is someone named 00794539, even with two different machines on the same LAN. Would installing new clients direct from git work better? I am using the latest release of the terminal and desktop client, from appimage and npm.
It's a protocol bug, unfixed! New cabals that remain private tend to stay OK though.
On 10/06 18:52, makeworld wrote:
I having the same issue, they only peer I see in cabals beyond myself is 00794539, even with two different machines on the same LAN. Would installing a new client direct from git work better? I am using the latest release of the terminal and desktop client, from appimage and npm.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cabal-club/cabal-core/issues/17#issuecomment-427619150
Could you elaborate on that? A bug in what part of cabal? Would using an older version fix this, since it seems to only be an issue with the new version?
Also, I'm not sure what you mean by a "new" cabal that "remains private". Like it didn't exist before the protocol update? What does private mean in this context?
@makeworld-the-better-one I think it's a bug in a lower level database module, multifeed. Using an older version of cabal would let you sidestep this bug, but there were even worse bugs with hyperdb!
By "new", anything created now onwards (/w latest cabal-cli). By "private" I mean, low traffic. The more users hitting the cabal, the higher chance of the race condition being hit. It doesn't make sense to rely on any cabals right now though for anything critical, until this bug is fixed.
@noffle thanks for the info! By latest cabal-cli do you mean from git or from npm? Also, didn't you fix the race condition as you said above? Or it's still having issues I guess.
npm i -g cabal
will work! and yes, there's still at least another race condition to find & fix.
I tried to create a cabal on machine and chat between that machine and another one on the same LAN. It didn't work, the cabal was created, but neither machine could see the other. It worked with two clients on the same machine though. This is all with cabal cli from npm, from a day or two ago.
@makeworld do other mDNS apps work for you on this network? airpaste[1] is a fun + easy one to try.
@noffle good call, thanks. That's a cool app... that unfortunately doesn't work. So yes you must be right, it's probably mDNS that's failing. I'm guessing that's an issue with one computer being on wifi and another being wired, as said here. I'll try with two devices connected in the same way, or try and fix my router.
OK, I think this is fixed! :ok_hand: You can install cabal@5.0.0 to get the latest goodies, or git fetch
the latest and reinstall deps.
Some cabals "stopped working". The symptoms were that you'd see yourself connect to peers, but new messages wouldn't appear, and others wouldn't see your messages. New peers couldn't download old state and would see an empty-looking cabal. This seemed to happen more frequently with high traffic cabals, like the public cabal on the cabal.chat website.
On a lower level, we noticed that these "broken cabals" had many empty hypercores in them. (cabal is built on multifeed, which manages a set of hypercores; each user maps to one hypercore.) Often many more hypercores than users: one public cabal had over 700.
At first I thought maybe we were getting spammed by someone, but me, @cblgh and @nikolaiwarner were able to reproduce it in a cabal only we had access to. I thought this might be the result of someone using an old version of cabal (like the one with hyperdb but not realizing it (package-lock.json and yarn.lock files and just plain node_modules management can be confusing)), but eventually I was able to reproduce it myself using a version of cabal that was definitely on master
with the latest deps.
multifeed wraps the hypercore replication protocol, hypercore-protocol. This wrapper has each peer send a formatted header before starting hypercore-protocol replication. The format was <UINT32:NUM_KEYS><LIST(BUFFER(32))>
, (each BUFFER(32) is a hypercore public key) so that each side knew which hypercores would be synced over the replication stream. Each side looks for keys that they don't already have locally and creates them in preparation for sync.
What was happening though is when garbage / unexpected data is sent, the peer will still interpret the first 4 bytes as "# of keys", even if it's something very big like 1290801. It will then read the next 1290801 * 32 bytes as keys of hypercores to create! Wuh oh.
This was very difficult to track down, since we knew nothing in the beginning but "sometimes cabals stop syncing".
@nikolaiwarner wrote a patch to cabal-cli that let you pass in --message
and --timeout
switches to the client that would let you post a message, sync, and then quit. Then he wrapped this in a bash loop and had two machines over the internet write to the same private cabal for a long time (1+ days) and thousands of messages. Eventually, he noticed that rogue empty hypercores would start to get created.
I added logging to multifeed replication using debug, and asked @cblgh and @nikolaiwarner to try to reproduce the bug again while also capturing log output.
Eventually I was able to reproduce it with logging turned on, and realized that somehow , every once in 1000s of sync connections, one peer would send garbage-looking data on the header handshake / key exchange that would result in 1s 10s or 100s of empty hypercores being created locally. Once they were created locally, the node would treat them like real hypercores and sync them to other peers.
The reason this would break cabals and not just result in vestigial hypercores is because hypercore-protocol has a hardcoded limit of 128 hypercores per replication stream. The more rogue hypercores, the lower the chance that you'll replicate with a hypercore related to a real user, and so you end up eventually not getting any new data from anyone, because your replication streams are saturated by the rogue empty hypercores.
I updated the header format in multifeed to send a length-prefixed JSON object as its header. This is much more strongly structured than the old format, and very difficult to produce through sending random bytes, or bytes from some other protocol by mistake. The new header also includes a replication protocol version, so that we can make backwards-incompatible changes (if we need to) in the future.
We've tried reproducing the bug with lots of us spamming a private cabal, and so far it seems to be holding up.
What's difficult is not knowing the root cause: why does a peer sometimes send unexpected header data?. For now, until the root cause of the race condition is clear, things should work OK, and in the rare care that a garbage header gets sent, the replication channel will shut down and reset itself (and, presumably, work).
I am really happy to see this fixed, thanks! But I am still unable to talk between to machines on the same LAN, or read from public cabals. I am on cabal (cli) 5.0.0 on both of them, and cannot see any messages or peers. I know I am having mDNS issues, but I figure they should still be able to communicate over bittorrent.
@makeworld-the-better-one Have you tried another network, to rule that out?
Also, the new public cabal is bd45fde0ad866d4069af490f0ca9b07110808307872d4b659a4ff7a4ef85315a
@makeworld-the-better-one You can try running each peer with DEBUG=* cabal --key KEY --seed
to capture a full dump of debug output for mDNS, bittorrent, etc, which might give some clues!
🔥 🔥 🔥 🔥 🔥 🔥
Congratulation!
@noffle So as you maybe saw in the in the new public cabal, I was able see the messages in the cabal you posted, on both machines. I said something from my pi, and it showed up on my computer, but I couldn't get the reverse to work.
@makeworld-the-better-one same network?
@noffle Yep.
I found some clues about header corruption and wrote them up in multifeed's issues.
It seems to be losing some bytes at the start of the stream, then beginning to deserialize in the middle of the header's JSON string.
Connecting to a cabal shows yourself but no nicks, no messages, and no channels from remote peers. A peer joining from another client on the same machine appears correctly however.
cc @noffle @cblgh