Project one configuration in one go

HISKP-LQCD / sLapH-projection-NG

2 stars 0 forks source link

Project one configuration in one go #29

Closed martin-ueding closed 4 years ago

martin-ueding commented 4 years ago

Previously I had projected one frame and irrep at a time per configuration, generating one file. This however leads to re-loading of the HDF5 files around 150 times, the number of moving frames and irreps. This puts a very high load on the file system and therefore there is little progress in the actual computation. It would therefore make sense to merge the prescriptions and do it in one full go.

martin-ueding commented 4 years ago

The correlator files for each configuration have the following size:

$ du -csm *_cnfg0240.h5
1       C2c_cnfg0240.h5
8       C4cC_cnfg0240.h5
8       C4cD_cnfg0240.h5
133     C6cC_cnfg0240.h5
192     C6cCD_cnfg0240.h5
67      C6cD_cnfg0240.h5
406     total

So that is 406 MB in total.

The largest prescriptions are 9 MB of size. I just merged all of them into one big one, that has 106 MB file size. R apparently needs 627 MB to keep that in memory. It might be that we now take a few GB of memory to do the projection of all moving frames and irreps at the same time. I will try to launch such a job.

martin-ueding commented 4 years ago

I have tried this, and 2000 MB of memory are not enough. So I rolled this back and tried to copy the files to the local HDD at the beginning of the job script. This way there is less load on the network and more on the local disk. This should scale a bit better.

It turns out that there still is a lot of contention on the disk, as can be seen by a lot of my jobs being in disk sleep state D:

Bildschirmfoto_005

They are fine on the network but on the disk there is tremendous activity:

Bildschirmfoto_006

Perhaps with a single job that would not be that bad, with with that many jobs there is a lot of latency.

So perhaps just doing one job per node would be a solution.

martin-ueding commented 4 years ago

I've reconfigured my SLURM to have a gres called disk and my jobs consume these. This way there is just a single one of them per node. Together with the caching to the local HDD this should work somewhat well.

On my laptop with the SSD the balances are quite a bit different …

martin-ueding commented 4 years ago

I'll stick with single jobs and caching to the local HDD. Also I now skip the ones that have already been projected. By just re-running the failed parts often enough I finally got a result. It seems that something is unstable in this process and to me the NFS is the weakest link.