Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
121 stars 64 forks source link

Scaling out BeeGFS #121

Closed lmiroslaw closed 4 years ago

lmiroslaw commented 4 years ago

How can I add the new storage or metaserver to the cluster?

I have tried to follow official documentation here, e.g. scale out the compute/beegfssm VMSSs, restarted the services but the nodes are still not recognized by the BeeGFS manager.

Am I am missing something?

garvct commented 4 years ago

The following procedure worked for me.

[hpcadmin@beegfsm beegfs]$ beegfs-check-servers Management

beegfsm [ID: 1]: reachable at 10.34.4.14:8008 (protocol: TCP)

Metadata

beegfa57e000000 [ID: 1]: reachable at 10.34.4.4:8005 (protocol: TCP) beegfa57e000004 [ID: 2]: reachable at 10.34.4.8:8005 (protocol: TCP) beegfa57e000003 [ID: 3]: reachable at 10.34.4.7:8005 (protocol: TCP) beegfa57e000001 [ID: 4]: reachable at 10.34.4.5:8005 (protocol: TCP) beegfa57e000006 [ID: 5]: reachable at 10.34.4.12:8005 (protocol: TCP) beegfa57e000005 [ID: 6]: reachable at 10.34.4.6:8005 (protocol: TCP)

Storage

beegfa57e000001 [ID: 1]: reachable at 10.34.4.5:8003 (protocol: TCP) beegfa57e000003 [ID: 2]: reachable at 10.34.4.7:8003 (protocol: TCP) beegfa57e000004 [ID: 3]: reachable at 10.34.4.8:8003 (protocol: TCP) beegfa57e000000 [ID: 4]: reachable at 10.34.4.4:8003 (protocol: TCP) beegfa57e000006 [ID: 5]: reachable at 10.34.4.12:8003 (protocol: TCP) beegfa57e000005 [ID: 6]: reachable at 10.34.4.6:8003 (protocol: TCP)

We can see that 2 extra storage and metadata servers have been added.

lmiroslaw commented 4 years ago

It worked. Thanks. However, this is strange that I don't see the performance improvement when doubling the size of beegfsm. I am testing the performance by copying the 24GB folder between two locations: time cp sim sim3 -R The folder contains ca. 120 directories with several files in each in MB range (2.2M, 119MB, 47MB).

For small and bigger beegfsm I get the same result. real 2m26.809s user 0m0.461s sys 0m29.615s

vs real 2m32.859s user 0m0.440s sys 0m28.253s

IO Pattern: 55k reads, 50k writes, summing up to 90% of execution time.

I also tried to change the chunk_size with beegfs-ctl --setpattern --chunksize=1m --numtargets=8 /beegfs/chunksize_1m_4t to 1m, 64kB and 4m size with 8, 1, 8 targets, respectively.

This did not affect the results much.

garvct commented 4 years ago

Have you tried multiple cp's ? Maybe each cp to a different target. May need to determine if the source data is on 4 storage targets or more. Need to determine if reading or writing is slowing the performance.

garvct commented 4 years ago

Try to maximize the number of disks working on the I/O operation. beegfs-df can help to see what disks/targets are active.

lmiroslaw commented 4 years ago

First feedback: This is my first attempt to parallelize cp operation:

for i in {0..N}
do
  cp -r $sourcedir/processor$i/* $destination/processor$i  &
done
wait # wait for cp threads to finish

With this code I was able to reduce the copying time from 1m41sec to 58 secs. Now I will test the same code after doubling the size of the cluster.

garvct commented 4 years ago

Closed