nikolaiwarner commented 6 years ago

Connecting to a cabal shows yourself but no nicks, no messages, and no channels from remote peers. A peer joining from another client on the same machine appears correctly however.

cc @noffle @cblgh

cblgh commented 6 years ago

this is the cabal nick tried to join:

cabal://3115ddead69876368789e03101ab5136ccd445449024f08b3318986690467905

i created it, so we definitely shouldn't be as many peers as the sidebar is showing haha. i only see myself in it, with my messages

hackergrrl commented 6 years ago

I'm able to join & sync down that cabal fyi, but nobody on the public one

hackergrrl commented 6 years ago

me and @nikolaiwarner are experimenting:

we can't repro issues on the secret core cabal, nor on a fresh one
on the public cabal I see nick join, but as a key /wo a name set, and don't seem to get his msgs
on the secret core cabal, I see what looks like a new key joining and then leaving, every 3-4 seconds. maybe an old client trying to reseed the cabal or something?

hackergrrl commented 6 years ago

@nikolaiwarner: it sounds like you were running an old client or something? how did you fix this on your end? I don't think others were affected by this? @cblgh?

hackergrrl commented 6 years ago

When I look at my cabal directory for 00794539a8ce6bed76e40b9d259666303d39271da66140282bfbce76fd9a4434 I see 776 directories, which means 776 hypercores. The maximum number of hypercores that hypercore-protocol replicates over one stream is 128 (why?). It emits an error, but hypercore doesn't listen for it, so it gets swallowed silently. This answers one question, which was "why can't we replicate eachother's messages?".

The other question is, "how did we end up with 776 hypercores on this cabal?". At first I thought it was an old client trying to connect, and multifeed was interpreting the hyperdb protocol data as noise and created lots of junk feeds, but I can't reproduce that. Maybe somebody has been generating tons of new keys (intentionally or not) on our public cabals.

hackergrrl commented 6 years ago

Either way, it seems important that

a cabal can handle more than 128 users across all history, and
generating a bunch of garbage identities can't break a cabal

fenwick67 commented 6 years ago

@noffle FWIW I saw many (116) directories being generated locally, all at once, in a case where the cabal key was private and only 2 clients:

screenshot showing 116 directories all created at the same time

hackergrrl commented 6 years ago

@fenwick67 I think this has to do with an old (hyperdb) client trying to participate in a newer (kappa-core) cabal, but this remains unproven.

cblgh commented 6 years ago

@fenwick67 I think this has to do with an old (hyperdb) client trying to participate in a newer (kappa-core) cabal, but this remains unproven.

which has a wip fix in https://github.com/cabal-club/cabal-core/pull/22 :D!

hackergrrl commented 6 years ago

I can't reproduce this. I created a kappa cabal and had a hyperdb client try to join it. It created empty users in the user sidebar, but my local cabal dir wasn't getting filled with empty hypercores like we see on the broken cabals.

cinnamon-bun commented 6 years ago

(I'm using the latest cabal CLI for all of this:)

I cleared out .cabal and tried again with the public cabal cabal://00794539.... I got 43 feeds in .cabal but no chat messages show in the UI. I see one peer in the sidebar, 00794539, which is the same as the cabal's hash! Is that the "fake" hypercore?

Those 43 feeds come from out there on the internet somewhere. Doing this experiment with a couple of local peers and no internet connection, I don't get them (nor the mystery peer 00794539).

I created a fresh cabal of my own and added a couple of local peers all running inside my machine. It all works as expected with cabal CLI, and no extra feeds appear. I also joined with the latest cabal-desktop and though it had troubles of its own [1], no extra feeds appeared.

[1] When it joined, it showed preexisting messages as all coming from conspirator. After a couple of restarts and new chat messages, it showed everything correctly.

hackergrrl commented 6 years ago

@cinnamon-bun

(I'm using the latest cabal CLI for all of this:)

I actually just published some multifeed-index fixes very recently (within the hour). I wonder if you tested before I pushed those or since. It might be that these fixes repaired the issues you were seeing before!

I see one peer in the sidebar, 00794539, which is the same as the cabal's hash! Is that the "fake" hypercore?

The peer who's key matches the cabal is the original creator of that cabal. (maybe me?)

hackergrrl commented 6 years ago

Hey debug friends! I created a new test cabal for us to try and break with the latest cabal-cli client:

cabal://58dc528ab340938eb66a29f80583ca1b0dcb9034ee78875ac695fbc8359b3581

I pushed some fixes to multifeed-index that explain some weird race conditions around messages not appearing, so I'm keen to see if we can break another cabal and do forensics on it if so.

Feel free to spam the heck outta this & do whatever weird stuff you'd like (as long as you document what you did!)

makew0rld commented 6 years ago

I having the same issue, they only peer I see in the main cabal beyond myself is someone named 00794539, even with two different machines on the same LAN. Would installing new clients direct from git work better? I am using the latest release of the terminal and desktop client, from appimage and npm.

hackergrrl commented 6 years ago

It's a protocol bug, unfixed! New cabals that remain private tend to stay OK though.

On 10/06 18:52, makeworld wrote:

I having the same issue, they only peer I see in cabals beyond myself is 00794539, even with two different machines on the same LAN. Would installing a new client direct from git work better? I am using the latest release of the terminal and desktop client, from appimage and npm.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cabal-club/cabal-core/issues/17#issuecomment-427619150

makew0rld commented 6 years ago

Could you elaborate on that? A bug in what part of cabal? Would using an older version fix this, since it seems to only be an issue with the new version?

Also, I'm not sure what you mean by a "new" cabal that "remains private". Like it didn't exist before the protocol update? What does private mean in this context?

hackergrrl commented 6 years ago

@makeworld-the-better-one I think it's a bug in a lower level database module, multifeed. Using an older version of cabal would let you sidestep this bug, but there were even worse bugs with hyperdb!

By "new", anything created now onwards (/w latest cabal-cli). By "private" I mean, low traffic. The more users hitting the cabal, the higher chance of the race condition being hit. It doesn't make sense to rely on any cabals right now though for anything critical, until this bug is fixed.

makew0rld commented 6 years ago

@noffle thanks for the info! By latest cabal-cli do you mean from git or from npm? Also, didn't you fix the race condition as you said above? Or it's still having issues I guess.

hackergrrl commented 6 years ago

npm i -g cabal will work! and yes, there's still at least another race condition to find & fix.

makew0rld commented 6 years ago

I tried to create a cabal on machine and chat between that machine and another one on the same LAN. It didn't work, the cabal was created, but neither machine could see the other. It worked with two clients on the same machine though. This is all with cabal cli from npm, from a day or two ago.

hackergrrl commented 6 years ago

@makeworld do other mDNS apps work for you on this network? airpaste[1] is a fun + easy one to try.

[1] https://github.com/mafintosh/airpaste

makew0rld commented 6 years ago

@noffle good call, thanks. That's a cool app... that unfortunately doesn't work. So yes you must be right, it's probably mDNS that's failing. I'm guessing that's an issue with one computer being on wifi and another being wired, as said here. I'll try with two devices connected in the same way, or try and fix my router.

hackergrrl commented 6 years ago

OK, I think this is fixed! :ok_hand: You can install cabal@5.0.0 to get the latest goodies, or git fetch the latest and reinstall deps.

post mortem

what went wrong?

Some cabals "stopped working". The symptoms were that you'd see yourself connect to peers, but new messages wouldn't appear, and others wouldn't see your messages. New peers couldn't download old state and would see an empty-looking cabal. This seemed to happen more frequently with high traffic cabals, like the public cabal on the cabal.chat website.

On a lower level, we noticed that these "broken cabals" had many empty hypercores in them. (cabal is built on multifeed, which manages a set of hypercores; each user maps to one hypercore.) Often many more hypercores than users: one public cabal had over 700.

At first I thought maybe we were getting spammed by someone, but me, @cblgh and @nikolaiwarner were able to reproduce it in a cabal only we had access to. I thought this might be the result of someone using an old version of cabal (like the one with hyperdb but not realizing it (package-lock.json and yarn.lock files and just plain node_modules management can be confusing)), but eventually I was able to reproduce it myself using a version of cabal that was definitely on master with the latest deps.

multifeed wraps the hypercore replication protocol, hypercore-protocol. This wrapper has each peer send a formatted header before starting hypercore-protocol replication. The format was <UINT32:NUM_KEYS><LIST(BUFFER(32))>, (each BUFFER(32) is a hypercore public key) so that each side knew which hypercores would be synced over the replication stream. Each side looks for keys that they don't already have locally and creates them in preparation for sync.

What was happening though is when garbage / unexpected data is sent, the peer will still interpret the first 4 bytes as "# of keys", even if it's something very big like 1290801. It will then read the next 1290801 * 32 bytes as keys of hypercores to create! Wuh oh.

how did we find it?

This was very difficult to track down, since we knew nothing in the beginning but "sometimes cabals stop syncing".

@nikolaiwarner wrote a patch to cabal-cli that let you pass in --message and --timeout switches to the client that would let you post a message, sync, and then quit. Then he wrapped this in a bash loop and had two machines over the internet write to the same private cabal for a long time (1+ days) and thousands of messages. Eventually, he noticed that rogue empty hypercores would start to get created.

I added logging to multifeed replication using debug, and asked @cblgh and @nikolaiwarner to try to reproduce the bug again while also capturing log output.

Eventually I was able to reproduce it with logging turned on, and realized that somehow , every once in 1000s of sync connections, one peer would send garbage-looking data on the header handshake / key exchange that would result in 1s 10s or 100s of empty hypercores being created locally. Once they were created locally, the node would treat them like real hypercores and sync them to other peers.

The reason this would break cabals and not just result in vestigial hypercores is because hypercore-protocol has a hardcoded limit of 128 hypercores per replication stream. The more rogue hypercores, the lower the chance that you'll replicate with a hypercore related to a real user, and so you end up eventually not getting any new data from anyone, because your replication streams are saturated by the rogue empty hypercores.

how did we fix it?

I updated the header format in multifeed to send a length-prefixed JSON object as its header. This is much more strongly structured than the old format, and very difficult to produce through sending random bytes, or bytes from some other protocol by mistake. The new header also includes a replication protocol version, so that we can make backwards-incompatible changes (if we need to) in the future.

We've tried reproducing the bug with lots of us spamming a private cabal, and so far it seems to be holding up.

What's difficult is not knowing the root cause: why does a peer sometimes send unexpected header data?. For now, until the root cause of the race condition is clear, things should work OK, and in the rare care that a garbage header gets sent, the replication channel will shut down and reset itself (and, presumably, work).

makew0rld commented 6 years ago

I am really happy to see this fixed, thanks! But I am still unable to talk between to machines on the same LAN, or read from public cabals. I am on cabal (cli) 5.0.0 on both of them, and cannot see any messages or peers. I know I am having mDNS issues, but I figure they should still be able to communicate over bittorrent.

hackergrrl commented 6 years ago

@makeworld-the-better-one Have you tried another network, to rule that out?

Also, the new public cabal is bd45fde0ad866d4069af490f0ca9b07110808307872d4b659a4ff7a4ef85315a

hackergrrl commented 6 years ago

@makeworld-the-better-one You can try running each peer with DEBUG=* cabal --key KEY --seed to capture a full dump of debug output for mDNS, bittorrent, etc, which might give some clues!

cblgh commented 6 years ago

🔥 🔥 🔥 🔥 🔥 🔥

neauoire commented 6 years ago

Congratulation!

makew0rld commented 6 years ago

@noffle So as you maybe saw in the in the new public cabal, I was able see the messages in the cabal you posted, on both machines. I said something from my pi, and it showed up on my computer, but I couldn't get the reverse to work.

hackergrrl commented 6 years ago

@makeworld-the-better-one same network?

makew0rld commented 6 years ago

@noffle Yep.

cinnamon-bun commented 6 years ago

I found some clues about header corruption and wrote them up in multifeed's issues.

It seems to be losing some bytes at the start of the stream, then beginning to deserialize in the middle of the header's JSON string.

cabal-club / cabal-core

Some cabals are not syncing between remote peers #17

post mortem

what went wrong?

how did we find it?

how did we fix it?