betamos / payload

File transfer for humans.
44 stars 0 forks source link

Devices unable to communicate after one year of use #8

Open probablykory opened 1 year ago

probablykory commented 1 year ago

I have a macbook on Catalina and a NUC on Win10, both devices on the same network, able to ping eachother and am able to ssh between them, but for some reason both machines lists zero available devices, "You're the only one here".

Is there any way to troubleshoot device discovery?

betamos commented 8 months ago

I'm sorry, for some reason my Github notifications are not set up correctly. Probably too late to the party, but in case anyone else runs into the same issue:

Checking mDNS

On version v0.1.3 of Payload you can run the standalone payload-agent[.exe] next to the main binary from the command line. If mDNS discovery is working at all, you should see messages about it:

{"type":"local:discovered","data":{"did":"NW9Pmc.....","addrs":[{"IP":"192.168.1.131","Port":50647,"Zone":""}]}}

If discovery is not working, there's an issue with mDNS, most likely misconfigured (i.e. disabled by the router) or firewalled. I'm working on a much improved mDNS library over at https://github.com/betamos/zeroconf which you can play around with if you're curious (there's a CLI). Alternatively, use any other mDNS based software/tool and see if there's an issue.

Checking connections

OTOH, if discovery is working, but the device shows as offline, you can check if it connects appropriately:

{"type":"local:connected","data":{"did":"NW9Pmc.....","last_id":54}}

If not, then there's an issue connecting. Payload installs a firewall rule on Windows to aid the OS firewall issues. I've seen a driver bug for a NIC on Windows, but more likely this is a bug in Payload, possibly from mDNS giving bad addresses. You could try turning off network interfaces that are not in use to reduce the chances of confusing, but chances are small.

Resetting the app

If you can connect, but it's not showing up as online, it's likely a sync issue. This can happen if the devices cannot send or process messages (including stored, older messages). I've seen this happen only once, after trying to transfer a very large set of files. This can cause the json message to be too long, and permanently damages the ability for the devices to "catch up" or sync.

If this happened, you can try to reset the local database, by going to the data directory (it's in ~/Library/Application Support/Payload on Mac - you can search for a file called events-v2.jsonl). Delete all files in this directory to reset the state, and start over. It should be enough to this on one side only.

Help out

Testing networked apps is hard! If you (or anyone else) would like to help me out, and debug and test new versions, (and help me fix your issue), I'd love to help. I can offer future premium features for free. Open an issue or see http://payload.app/about for contact. Thanks

probablykory commented 8 months ago

Indeed late, but none the less this helped me resolve the issue. Discovery and Connecting worked fine for my setup. Resetting did the trick. I reset the side that tends to receive more files first, and when that didn't work I reset the transmitter. (My transfers, though not always, tend to be from my main windows machine over to the macbook).

If I'm able to reproduce the issue again in the future, I'll attempt to investigate further into the events-v2.json1 file to corroborate the sync issue.

Thanks for the reply. Payload is a wonderful app and I'm happy to be using it again for casual xfers 😃

betamos commented 8 months ago

Amazing. Are you able to confirm that you had large (say >200) number of files in any transfer, at some point, even if it was canceled? If you still have events-v2.jsonl (the file that I in a moment of great wisdom asked you to delete without backing up), you can check for line lengths. There's one json message per line, so if there's an outlier it should be extremely long.

If not, there's another sync-preventing bug lurking (which is very interesting if so). Anything about the nature of the problematic transfer would be helpful (such as symlinks, "special" characters, deeply nested structure, etc). Fortunately these things are fairly easy to unit test.

BTW: What's the reason for staying on Catalina? Just wondering because several of my tools (Go and Tauri) are rather aggressive with discontinuing legacy OS support - something I have very little control over.

probablykory commented 8 months ago

Luckily the files were simply in the trashcans, not actually deleted. There are two lines that are orders of magnitude longer than the rest, which correspond to xfers of chunks of a music library, 6636 cols and 21119 cols in length. I'm attaching both the macbook and the windows files just in case.

Per your other question: The reason for staying on catalina is simply due to the age of the device. I'll have to check to see if anything's changed, but last I tried catalina was the latest OS it could run. It's a fully functional mbp 13 from mid 2012; the bugger just won't die.

betamos commented 8 months ago

Thanks again.

I actually recently discovered this bug but it's unreported before. It's not because of too large file sets at all, but because there's a TLS cert with an expiry of 1 year. The devices were initialized in July 22, and were thus mutually unable to communicate in July 23.

This cert should never have been persisted to disk, but instead regenerated.

It's a fully functional mbp 13 from mid 2012; the bugger just won't die.

I hear you, I suffered from a similar type of success with my old MBP from 2011, which eventually died because of a 3p a Radeon GPU that wasn't even Apple's fault. Godspeed to it.

Feel free to delete the zip.

betamos commented 16 hours ago

Anyone reading this can now download the new beta. https://payload.app/beta