Closed martin-ueding closed 4 years ago
The correlator files for each configuration have the following size:
$ du -csm *_cnfg0240.h5
1 C2c_cnfg0240.h5
8 C4cC_cnfg0240.h5
8 C4cD_cnfg0240.h5
133 C6cC_cnfg0240.h5
192 C6cCD_cnfg0240.h5
67 C6cD_cnfg0240.h5
406 total
So that is 406 MB in total.
The largest prescriptions are 9 MB of size. I just merged all of them into one big one, that has 106 MB file size. R apparently needs 627 MB to keep that in memory. It might be that we now take a few GB of memory to do the projection of all moving frames and irreps at the same time. I will try to launch such a job.
I have tried this, and 2000 MB of memory are not enough. So I rolled this back and tried to copy the files to the local HDD at the beginning of the job script. This way there is less load on the network and more on the local disk. This should scale a bit better.
It turns out that there still is a lot of contention on the disk, as can be seen by a lot of my jobs being in disk sleep state D:
They are fine on the network but on the disk there is tremendous activity:
Perhaps with a single job that would not be that bad, with with that many jobs there is a lot of latency.
So perhaps just doing one job per node would be a solution.
I've reconfigured my SLURM to have a gres called disk and my jobs consume these. This way there is just a single one of them per node. Together with the caching to the local HDD this should work somewhat well.
On my laptop with the SSD the balances are quite a bit different …
I'll stick with single jobs and caching to the local HDD. Also I now skip the ones that have already been projected. By just re-running the failed parts often enough I finally got a result. It seems that something is unstable in this process and to me the NFS is the weakest link.
Previously I had projected one frame and irrep at a time per configuration, generating one file. This however leads to re-loading of the HDF5 files around 150 times, the number of moving frames and irreps. This puts a very high load on the file system and therefore there is little progress in the actual computation. It would therefore make sense to merge the prescriptions and do it in one full go.