FoldingAtHome / fah-client-bastet

Folding@home client, code named Bastet
GNU General Public License v3.0
74 stars 13 forks source link

Cloned machines have same F@H ID which leads to conflicts #216

Closed hucker75 closed 6 months ago

hucker75 commented 9 months ago

I have two almost identical machines, I cloned one from the other. With the old Beta it was ok, but the new Beta thinks they're the same machine. Where is this info stored?

kbernhagen commented 9 months ago

The top client.db has the client identity, among other things. You should not copy it when cloning.

When a client is linked to an account, it will pick up user, team, passkey, cause.

kbernhagen commented 9 months ago

You should not copy the work directory either.

kbernhagen commented 9 months ago

Cloning is a bad idea in general.

kbernhagen commented 9 months ago

I believe the mechanism for partially configuring new clients in a mass deployment is to craft and pre-install a shared config.xml. Nothing else is safe.

hucker75 commented 9 months ago

In future I'll just uninstall/reinstall Folding after the clone. That seems to have fixed it this time.

Oddly the previous Beta didn't mind.

kbernhagen commented 9 months ago

The previous beta did not have accounts.

hucker75 commented 9 months ago

I thought it still would have confused things by asking for a work unit when the server thought it already had one, as it thinks it's the same machine. Or did the server assume I had twice as many CPU cores and GPUs as I did?

One of the two clones is now not showing up on the other machines' list on web control. Not sure why. I'll try reinstalling and making sure the items you said above have been deleted. Even telling the uninstaller to delete "data", it remembered my account, or was that the browser?

I don't understand why the reinstall worked then failed by the next day.

kbernhagen commented 9 months ago

The browser stores your login.

I don’t know how the ID was used previously. Or by servers.

hucker75 commented 9 months ago

I just rebooted both machines, and now they both appear, I don't like intermittent problems.

kbernhagen commented 9 months ago

Agreed.

There has been a problem with clients disconnecting from the node. It seems to mostly be on Windows around zero hours UTC. The cause has not been tracked down yet. A client restart is sufficient to reconnect.

hucker75 commented 9 months ago

So it's just coincidence it was one of the cloned machines? I have 6 other machines which aren't cloned, and they haven't disconnected.

I did try restarting both Folding clients, this didn't fix it, I had to restart Windows.

kbernhagen commented 9 months ago

Interesting. Joseph might know if it's more than coincidence.

hucker75 commented 9 months ago

And last night two different machines did it, while the cloned pair stayed on ok. Looks like it's nothing to do with the type of machine.

To get them back on, logging out and in on the machine which had disappeared didn't help, I had to exit and relaunch folding.

hucker75 commented 9 months ago

Cloning is a bad idea in general.

It's a common thing for those of us with many machines. Folding should be able to spot there are two identical machines, and get them to regenerate an ID.

kbernhagen commented 9 months ago

Cloning is a bad idea in general.

It's a common thing for those of us with many machines. Folding should be able to spot there are two identical machines, and get them to regenerate an ID.

I was speaking of just cloning the fah data directory.

For a client to know it has a duplicate id, it would probably need to also use some hardware id. I don't think @jcoffland wants to do that.

Some people have the data directory on a flash drive, and use it on different machines, one at a time. Any duplicate detection scheme would have to account for that. Ideas are welcome.

hucker75 commented 9 months ago

I think the detection should come from the server when two machines connect at once claiming to be the same one.

kbernhagen commented 9 months ago

Ah, so just a collision detection? Ask duplicate clients to regenerate their id?

kbernhagen commented 9 months ago

Looks like the client id is sent when requesting work. I don't know if the work upload client id needs to match.

hucker75 commented 9 months ago

Yes, collision detection. The current alternative of uninstalling and reinstalling destroys the current job anyway.

kbernhagen commented 9 months ago

Collision detection of machine name might also be useful.

jcoffland commented 9 months ago

A client's F@H ID is computed from it's RSA key pair. Specifically from it's public key. If you copy the client DB then it will have the same ID and appear to be the same machine on the F@H network.

This could be a problem for others since cloning VMs or Docker instances is quite common. One solution is to make sure to delete /var/lib/fah-client/client.db before cloning a F@H instance. Another possibility would be for the client to try to detect the machine ID and regenerate its ID if it detects a change in the ID of the machine it's on.

The difficulty with the later solution is that Linux, Windows and macOS will all have different methods for acquiring a unique machine ID.

Linux

We could use /etc/machine-id. However, if you clone a Linux machine you could also copy this file but then maybe it's the user's fault.

Windows

Maybe HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid.

macOS

Maybe ioreg -ad2 -c IOPlatformExpertDevice | plutil -extract IORegistryEntryChildren.0.IOPlatformUUID raw -

marcosfrm commented 9 months ago

Linux

We could use /etc/machine-id. However, if you clone a Linux machine you could also copy this file but then maybe it's the user's fault.

Different machines with the same machine-id are in error:

https://man7.org/linux/man-pages/man5/machine-id.5.html

  For operating system images which are created once and used on
  multiple machines, for example for containers or in the cloud,
  /etc/machine-id should be either missing or an empty file in the
  generic file system image (the difference between the two options
  is described under "First Boot Semantics" below). An ID will be
  generated during boot and saved to this file if possible. [...]

Before booting into the cloned system, truncate /etc/machine-id (truncate -s 0 …) and a new one will be automatically generated next time.

hucker75 commented 9 months ago

I assume cloning software knows this and removes the file? Mind you it doesn't bother removing the Windows network name, so maybe not. Until you run both at once, Windows doesn't do anything about it. These should both be a basic part of the cloning software making the clone.

marcosfrm commented 9 months ago

I assume cloning software knows this and removes the file?

I have no idea.

Running once the cloned system booted will probably work too:

sudo truncate -s 0 /etc/machine-id
sudo reboot

(I recommend truncating rather than removing it to preserve extended attributes like SELinux context)

hucker75 commented 9 months ago

Sudo :-) Don't we all log in as root? Makes things so much easier.

hucker75 commented 9 months ago

Agreed.

There has been a problem with clients disconnecting from the node. It seems to mostly be on Windows around zero hours UTC. The cause has not been tracked down yet. A client restart is sufficient to reconnect.

Just saw it happen at midnight UTC. Everything disconnected, then all but one reconnected. Some kinda reset is going on.

jackschmidt commented 9 months ago

-- never mind -- I think the client.db key is only used to link the machine to the account via the node server. No big deal if it is lost, I think.

jcoffland commented 9 months ago

I think the new 8.3.6 code might have a surprise effect that causes a work unit to be lost/dumped

I'm assuming that turning in a work unit requires having access to the private key used to request it.

If someone uses 8.3.6, gets a new work unit, and then uses that same client.db on another 8.3.6, then the old RSA key used to turn in the work unit will be deleted.

If you cause the client to change keys then it will have a different ID and will not be able to continue any preexisting WUs. This is intentional. Say for example, you copy a fah-client install to a new machine. The client will detect that it's on a new machine and generate a new key. Then it will discard any WUs that were from the old client ID. The original copy can continue to run as normal.

I suppose the downside is that if you want to move a client to a new machine and finish it's WUs on the new machine but this is just not a scenario we support. If you really wanted to do it you could modify the machine-id in the client.db to match the new machine. This would prevent the client from generating a new key.

jcoffland commented 6 months ago

Should be fixed at least as of v8.3.16.