lmiroslaw commented 4 years ago

How can I add the new storage or metaserver to the cluster?

I have tried to follow official documentation here, e.g. scale out the compute/beegfssm VMSSs, restarted the services but the nodes are still not recognized by the BeeGFS manager.

Am I am missing something?

garvct commented 4 years ago

The following procedure worked for me.

Have existing beegfs (4xL8sv2), but want to increase storage and metadata servers from 4 to 6. -azhpc-resize beegfssm 6

[hpcadmin@beegfsm beegfs]$ beegfs-check-servers Management

beegfsm [ID: 1]: reachable at 10.34.4.14:8008 (protocol: TCP)

Metadata

beegfa57e000000 [ID: 1]: reachable at 10.34.4.4:8005 (protocol: TCP) beegfa57e000004 [ID: 2]: reachable at 10.34.4.8:8005 (protocol: TCP) beegfa57e000003 [ID: 3]: reachable at 10.34.4.7:8005 (protocol: TCP) beegfa57e000001 [ID: 4]: reachable at 10.34.4.5:8005 (protocol: TCP) beegfa57e000006 [ID: 5]: reachable at 10.34.4.12:8005 (protocol: TCP) beegfa57e000005 [ID: 6]: reachable at 10.34.4.6:8005 (protocol: TCP)

Storage

beegfa57e000001 [ID: 1]: reachable at 10.34.4.5:8003 (protocol: TCP) beegfa57e000003 [ID: 2]: reachable at 10.34.4.7:8003 (protocol: TCP) beegfa57e000004 [ID: 3]: reachable at 10.34.4.8:8003 (protocol: TCP) beegfa57e000000 [ID: 4]: reachable at 10.34.4.4:8003 (protocol: TCP) beegfa57e000006 [ID: 5]: reachable at 10.34.4.12:8003 (protocol: TCP) beegfa57e000005 [ID: 6]: reachable at 10.34.4.6:8003 (protocol: TCP)

We can see that 2 extra storage and metadata servers have been added.

lmiroslaw commented 4 years ago

It worked. Thanks. However, this is strange that I don't see the performance improvement when doubling the size of beegfsm. I am testing the performance by copying the 24GB folder between two locations: time cp sim sim3 -R The folder contains ca. 120 directories with several files in each in MB range (2.2M, 119MB, 47MB).

For small and bigger beegfsm I get the same result. real 2m26.809s user 0m0.461s sys 0m29.615s

vs real 2m32.859s user 0m0.440s sys 0m28.253s

IO Pattern: 55k reads, 50k writes, summing up to 90% of execution time.

I also tried to change the chunk_size with beegfs-ctl --setpattern --chunksize=1m --numtargets=8 /beegfs/chunksize_1m_4t to 1m, 64kB and 4m size with 8, 1, 8 targets, respectively.

This did not affect the results much.

garvct commented 4 years ago

Have you tried multiple cp's ? Maybe each cp to a different target. May need to determine if the source data is on 4 storage targets or more. Need to determine if reading or writing is slowing the performance.

garvct commented 4 years ago

Try to maximize the number of disks working on the I/O operation. beegfs-df can help to see what disks/targets are active.

lmiroslaw commented 4 years ago

First feedback: This is my first attempt to parallelize cp operation:

for i in {0..N}
do
  cp -r $sourcedir/processor$i/* $destination/processor$i  &
done
wait # wait for cp threads to finish

With this code I was able to reduce the copying time from 1m41sec to 58 secs. Now I will test the same code after doubling the size of the cluster.

garvct commented 4 years ago

Closed

Azure / azurehpc

Scaling out BeeGFS #121

[hpcadmin@beegfsm beegfs]$ beegfs-check-servers Management

Metadata

Storage