NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

FATAL ERROR: "172.17.0.2:51081" is in use (duplicate or overlapping run?) #140

Closed HeinrichTremblay closed 11 months ago

HeinrichTremblay commented 1 year ago

Error Message

E 14:12:39.182323 target:296 FATAL ERROR: t[XIbjcKDg]: "172.17.0.2:51081" is in use (duplicate or overlapping run?)
FATAL ERROR: t[XIbjcKDg]: "172.17.0.2:51081" is in use (duplicate or overlapping run?)

Context

I have deployed aistore using the docker image with success at first following the docs. The fatal error message happened after I restarted my machine and run again the Docker image to start the cluster (since the container was not running anymore). Here is the docker run command:

docker run -d \
  -p 51080:51080 \
  -v /mnt/disk0:/ais/disk0 \
  -v /mnt/disk1:/ais/disk1 \
  -v /mnt/disk2:/ais/disk2 \
  aistorage/cluster-minimal:latest
alex-aizman commented 1 year ago

You say: "restarted my machine." It'd be interesting to find out why exactly lsof reports that somebody's still listening on 172.17.0.2:51081 after restart.

compiaffe commented 1 year ago

I see the exact same problem.

I check with lsof -sTCP:LISTEN -i tcp@localhost:51080 and lsof -sTCP:LISTEN -i tcp@localhost:51081 before starting. Nothing is reported there.

However, I can successfully start ais if I reformat the partition prior to starting docker.

HeinrichTremblay commented 1 year ago

I also checked with lsof -sTCP:LISTEN -i tcp@localhost:51080 and lsof -sTCP:LISTEN -i tcp@localhost:51081 and no output.

I inspected the source code for the check that trigger the error message and found the checkRestarted function that check for markers.

func (t *target) checkRestarted() (fatalErr, writeErr error) {
    if fs.MarkerExists(fname.NodeRestartedMarker) {
        // NOTE the risk: duplicate aisnode run - which'll fail shortly with "bind:
        // address already in use" but not before triggering (`NodeRestartedPrev` => GFN)
        // sequence and stealing nlog symlinks - that's why we go extra length
        if _lsof(t.si.PubNet.TCPEndpoint()) {
            fatalErr = fmt.Errorf("%s: %q is in use (duplicate or overlapping run?)",
                t, t.si.PubNet.TCPEndpoint())
            return
        }

        t.statsT.Inc(stats.RestartCount)
        fs.PersistMarker(fname.NodeRestartedPrev)
    }
    fatalErr, writeErr = fs.PersistMarker(fname.NodeRestartedMarker)
    return
}

I tried deleting .ais.markers directly in the mounted disks and now it seems to work again when I run the Docker image to start the cluster, and running ais show cluster confirms that my cluster is up as expected.

alex-aizman commented 1 year ago

of course. But that's illegal - the whole point of this specific persistent marker, and the reason for its existence, is to let us know that the node restarted without proper shutting-down.

compiaffe commented 1 year ago

The error message is a little confusing in that case. Shouldn't the system automatically try to recover from such a condition?

In any case, good to know how to manually recover.

alex-aizman commented 1 year ago

the keyword is "overlapping run". Maybe there's a better way to express the fact that there is another instance of ais storage target running (and listening on the same local port), and that immediate exit seems to be the best remedy.

compiaffe commented 12 months ago

Yes, the overlapping run is clear as such. However, the user doesn't explicitly spin up a second target, nor is on running on the host machine prior to starting the docker container. It is clearly the cluster-minimal container that tries to spin up multiple overlapping targets.

So the questions why the cluster-minimal container spins up multiple targets, causing the error message shown above.

The only difference between successful runs and the ones experiencing this behaviour is the mounting of an improperly shutdown volume.

alex-aizman commented 12 months ago

I just don't reproduce it. Here's what I've done:

# 1. run it first time
#  `/tmp/cluster-minimal` here is just an arbitrary place where the container can write 
$ docker run -d -p 51080:51080 -v /tmp/cluster-minimal:/ais/disk0 aistorage/cluster-minimal:latest
# 2. use it somehow, this new cluster
$ AIS_ENDPOINT=http://localhost:51080 aisloader -bucket=ais://nnn -cleanup=false -totalputsize=50M -duration=0 -minsize=1MB -maxsize=1MB -numworkers=8 -pctput=100 -quiet

$ AIS_ENDPOINT=http://localhost:51080 ais ls --summary
# 3. shutdown
$ AIS_ENDPOINT=http://localhost:51080 ais cluster shutdown
# 4. restart
$ docker run -d -p 51080:51080 -v /tmp/cluster-minimal:/ais/disk0 aistorage/cluster-minimal:latest
# 5. Finally, see that it sees ais://nnn bucket and generally works
export AIS_ENDPOINT=http://localhost:51080
$ ais show cluster
$ ais ls --summary

# and so on

This is with aistore v3.19

compiaffe commented 11 months ago

The difference is that I hadn't run ais cluster shutdown but instead either a restarted the machine, or a did a docker stop or docker compose down (depending on usage).

alex-aizman commented 11 months ago

https://github.com/NVIDIA/aistore/issues/140#issuecomment-1638401884

closing