Nside=4096 is prohibitively expensive

keskitalo commented 5 years ago

Trying to run LAT LF simulation at Nside=4096. There are 222 detectors. I can only allocate two MPI processes on each KNL node not to exhaust the memory during PySM call. I split the communicator into 256 groups, each with one node and two processes. There are two rank communicators with 256 MPI processes, each trying to simulate 111 detectors. The job did not complete in two hours. The medium and high frequency tubes have 10X as many detectors.

zonca commented 5 years ago

I can run some tests to check if this is expected, do you have a feeling on how this compares to my benchmark at https://github.com/healpy/pysm/issues/25#issuecomment-500657096?

zonca commented 5 years ago

how many threads are you using per MPI process?

keskitalo commented 5 years ago

That comment indicates 40 MPI tasks processing a channel every 60 seconds. Naively assuming perfect scaling, 256 MPI tasks would do the same in 10 seconds or 111 detectors in less than 20 minutes. Of course the KNL cores are slower but we have 32 OpenMP threads for each MPI task.

zonca commented 5 years ago

I started with a scaled down test on Popeye (48 core Intel skylake), basically 1/10 of the job:

13 nodes, 2 MPI processes per node, 24 OpenMP/Numba

Initialize sky: 143s
Bandpass integration per channel (10 points like TOAST now): 80s the first channel (loading and caching CIB maps), 6s the others
Smoothing: 7s
Memory (with memreport) 9.5GB after initialization, 10 GB at the end
Total time for 10 channels: 6 minutes

So just spreading maps over 26 MPI processes keeps memory consumption at ~10GB, is it possible that there are memory issues in other parts of the PySM operator? i.e. the map redistribution?

keskitalo commented 5 years ago

Thanks Andrea. This is really helpful. I cannot say if the memory consumption is driven by PySM or other parts of OpSimPysm. What happens if you try to run more than 2 processes per node?

memreport gives you the instantaneous memory usage but will not track the high water mark. Unless you are also calling it during PySM it won't be very helpful.

zonca commented 5 years ago

had the same question:

If I run 4 processes per node, same total number of nodes, Initialize sky: 148s Bandpass integration per channel (10 points like TOAST now): 80s the first channel (loading and caching CIB maps), 4s the others Smoothing: 7s Memory (with memreport) 9.35GB after initialization, 10.5 GB at the end Total time for 10 channels: 6 minutes

memreport is per node, right?

zonca commented 5 years ago

If I run 6 processes per node, 8 threads, same total number of nodes, Initialize sky: 137s Bandpass integration per channel (10 points like TOAST now): 75s the first channel (loading and caching CIB maps), 2s the others Smoothing: 6s Memory (with memreport) 7.7GB after initialization, 11.6 GB at the end Total time for 10 channels: 5 minutes

keskitalo commented 5 years ago

Correct: memreport is by node. It is very impressive that you get the same walltime with 4 and 6 processes per node! Can you fit even more processes per node? Are those numbers the available memory on the node or the total allocated?

zonca commented 5 years ago

I'm using the "used" row before GC:

***** Completed                                                                                                                              Walltime elapsed since the beginning 289.5 s                                                                                                 Memory usage                                                                                                                                        total :  755.539 GB  <  755.539 +-    0.000 GB  <  755.539 GB                                                                         
   available :  736.650 GB  <  740.300 +-    0.952 GB  <  740.941 GB                                                                         
     percent :    1.900 %   <    2.000 +-    0.128 %   <    2.500 %                                                                          
        used :    9.526 GB  <   10.134 +-    0.436 GB  <   11.656 GB                                                                         
        used :    9.526 GB  <    9.864 +-    0.349 GB  <   11.026 GB (after GC)                                                              
        free :  735.123 GB  <  744.262 +-    2.390 GB  <  744.942 GB                                                                         
      active :    3.085 GB  <    3.635 +-    0.278 GB  <    4.381 GB                                                                         
      active :    3.085 GB  <    3.333 +-    0.132 GB  <    3.574 GB (after GC)                                                              
    inactive :    0.661 GB  <    0.694 +-    2.023 GB  <    8.291 GB                                                                         
     buffers :    2.152 MB  <    2.152 +-    0.000 MB  <    2.152 MB                                                                         
      cached :    1.068 GB  <    1.145 +-    2.030 GB  <    8.758 GB                                                                         
      shared :  115.262 MB  <  126.734 +-    4.012 MB  <  134.121 MB                                                                         
        slab :    0.581 GB  <    0.698 +-    0.251 GB  <    1.622 GB

I think I can, the problem is that Popeye has 768 GB of ram per node, so there is no way I go out of memory...

keskitalo commented 5 years ago

Wow. That is a lot of memory. Without calling memreport inside PySM those numbers do not mean much.

zonca commented 5 years ago

what about seff?

2x24 job

Job ID: 49161
Cluster: slurm_cluster
User/Group: zonca/zonca
State: COMPLETED (exit code 0)
Nodes: 13
Cores per node: 48
CPU Utilized: 20:28:11
CPU Efficiency: 35.15% of 2-10:14:24 core-walltime
Job Wall-clock time: 00:05:36
Memory Utilized: 112.94 GB (estimated maximum)
Memory Efficiency: 1.16% of 9.52 TB (750.00 GB/node)

4x12 job

Job ID: 49301
Cluster: slurm_cluster
User/Group: zonca/zonca
State: COMPLETED (exit code 0)
Nodes: 13
Cores per node: 48
CPU Utilized: 18:10:32
CPU Efficiency: 31.58% of 2-09:32:48 core-walltime
Job Wall-clock time: 00:05:32
Memory Utilized: 192.46 GB (estimated maximum)
Memory Efficiency: 1.97% of 9.52 TB (750.00 GB/node)

6x8 job

Job ID: 49300
Cluster: slurm_cluster
User/Group: zonca/zonca
State: COMPLETED (exit code 0)
Nodes: 13
Cores per node: 48
CPU Utilized: 18:02:47
CPU Efficiency: 35.41% of 2-02:57:36 core-walltime
Job Wall-clock time: 00:04:54
Memory Utilized: 259.04 GB (estimated maximum)
Memory Efficiency: 2.66% of 9.52 TB (750.00 GB/node)

Seems reasonable that more MPI means more memory.

keskitalo commented 5 years ago

Comparing the 4x12 to 2x24 the memory consumption is 70% higher. 6x8 needs 130% more memory. Those are notable increases. If the maps were distributed evenly across the communicator, we would not expect any increase, would we?

Perhaps the map distribution in PySM is set up to broadcast the high resolution maps across the communicator before each process extracts its part of the map? If that were the case, it would drive up the memory high water mark and cause an unnecessary out-of-memory. Buffering the broadcast or using scatterv would avoid this.

This all seems compatible with my observation that at nside=4096 we cannot fit more than 2 MPI tasks on each KNL node. What happens when we go to nside=8192? TOAST includes a significant amount of Python glue that does not thread at all so limiting the number of MPI tasks like this is problematic.

The actual execution times you get are much better than my out-of-time test would have suggested. I wonder if OpSimPysm includes costly operations not included in your test. We probably need a test that runs the entire operator, not just PySM.

zonca commented 5 years ago

yes, correct, I broadcast the full map. I looked into using scatterv, but each process has rings which are not consecutive, and maps are often 2D, so it is going to be very complicated.

Also, in the worst case of IQU map at 4096, we are paying a temporary increase of 2.3 GB of memory per MPI process, so let's say you want to run 4 MPI processes per node, that would be ~10 GB. Or if we compare running 2 MPI tasks or 4 MPI tasks, in the second case you pay an extra 4.6 GB per node temporarily. It doesn't seem too bad. I am more worried about other parts of the PySM Operator.

I'll try to install TOAST on Popeye and see if I can run a test of the full OpSimPysm

keskitalo commented 5 years ago

So that is 40GB extra memory overhead for Nside=8192 and just 4 MPI processes per node? I don't think that is the best approach. If scatterv is not appropriate, then you can broadcast the map in 100MB pieces.

zonca commented 5 years ago

consider that this price is paid only:

at initialization for most components
during 1st channel bandpass integration for CIB (before smoothing)

so there are no timelines yet, so do you plan to have less than 40GB of timelines per node?

keskitalo commented 5 years ago

At this point the pointing has to be expanded already, so even if the signal doesn't yet exist, pixel numbers and pointing weights are in memory. OpSimPySM may also be adding to an existing signal so you cannot assume that memory to be available as workspace.

Furthermore, we need to have 8 or 16 MPI tasks per node, not 2 or 4.

zonca commented 5 years ago

could we use MPI shared memory?

keskitalo commented 5 years ago

That would be a step in the right direction. It still means that nside=8192 would require 10GB of workspace but at least it would not scale with the number of MPI tasks per node.

zonca commented 5 years ago

@keskitalo is the PySM call going out of memory in the initialization phase or during integration of the first channel or later on?

keskitalo commented 5 years ago

I don't think the log gave that information. The program termination under OOM is immediate. I'll check once NERSC is up again.

zonca commented 5 years ago

See mpi shared memory example in TOAST: https://github.com/hpc4cmb/toast/blob/master/src/toast/mpi.py#L532

zonca commented 5 years ago

I think the ideal solution would be to have input files in HDF5 and read them directly with a MPI-IO call into local buffers in each MPI process. @keskitalo would this be an option? Would this solution scale to the jobs you need for the current SO simulations?

keskitalo commented 5 years ago

So every process would read the same file? I suppose it works until it doesn't. You should make sure to open the files in read-only mode or otherwise the access is not parallel. We usually prefer read and broadcast because it will scale to very large concurrencies.

Are you planning on adding HDF5 support to Healpy? It is a lot of boilerplate to handle otherwise.

zonca commented 5 years ago

Debugging the memory issue with this configuration (after having implemented the mpi_shared feature in #32):

The low frequency LAT simulation only has 222 detectors. The nominal case runs with 32 nodes. I allocated 128 nodes for the realistic case, organized into groups of 16 nodes (8 MPI procsses per node). This job fails with out-of-memory during OpSimPySM.

Submitted equivalent job on Popeye to understand memory usage.

zonca commented 5 years ago

ran on Cori the equivalent of 1 group 16 nodes 8 MPI processes per node, see memreport below:

Memory usage                                                                                                                                      total :   94.250 GB  <   94.250 +-    0.000 GB  <   94.250 GB
   available :   71.759 GB  <   72.441 +-    0.188 GB  <   72.663 GB
   available :   72.415 GB  <   72.529 +-    0.058 GB  <   72.663 GB (after GC)                                                            
     percent :   22.900 %   <   23.100 +-    0.198 %   <   23.900 %
     percent :   22.900 %   <   23.000 +-    0.079 %   <   23.200 %  (after GC)
        used :    6.147 GB  <    6.356 +-    0.188 GB  <    7.027 GB                                                                       
        used :    6.146 GB  <    6.270 +-    0.058 GB  <    6.378 GB (after GC)
        free :   72.016 GB  <   72.817 +-    0.198 GB  <   73.052 GB
        free :   72.672 GB  <   72.912 +-    0.079 GB  <   73.053 GB (after GC)                                                            
      active :    4.141 GB  <    4.233 +-    0.181 GB  <    4.889 GB                                                                             active :    4.141 GB  <    4.155 +-    0.023 GB  <    4.242 GB (after GC)
    inactive :   13.991 GB  <   14.005 +-    0.046 GB  <   14.154 GB                                                                       
     buffers :    9.590 MB  <   10.043 +-    7.348 MB  <   40.336 MB                                                                             cached :   15.042 GB  <   15.051 +-    0.046 GB  <   15.200 GB
      shared :   14.621 GB  <   14.629 +-    0.006 GB  <   14.638 GB                                                                       
        slab :    1.172 GB  <    1.181 +-    0.007 GB  <    1.198 GB

PySM is reading 20 maps at 4096 from disk, so distributed over 16 nodes should be about 1GB per node: (769 megabytes * 20) / 16 = 961.25 megabytes

keskitalo commented 5 years ago

Are you calling memreport at the time of the highest memory requirement or after the operator finishes? Recall that memreport does not track the high watermark but only the current memory.

zonca commented 5 years ago

in my tests on Popeye memreport (inside the Operator) was pretty close with what I got with seff. @keskitalo what do you use to get high watermark on Cori?

keskitalo commented 5 years ago

I just call memreport where the memory consumption is expected to be the highest.

zonca commented 5 years ago

Asked NERSC, they don't have any suggestion on how to get the high watermark properly. So I created 4GB timelines for each process, so now even before PySM we are at:

***** Created fake timelines of 4.00 GB per process
Walltime elapsed since the beginning 6.9 s
Memory usage 
       total :   94.250 GB  <   94.250 +-    0.000 GB  <   94.250 GB
   available :   57.072 GB  <   57.503 +-    0.166 GB  <   57.995 GB
   available :   57.072 GB  <   57.428 +-    0.119 GB  <   57.584 GB (after GC)
     percent :   38.500 %   <   39.000 +-    0.176 %   <   39.400 % 
     percent :   38.900 %   <   39.100 +-    0.117 %   <   39.400 %  (after GC)
        used :   35.075 GB  <   35.567 +-    0.168 GB  <   36.011 GB
        used :   35.486 GB  <   35.643 +-    0.120 GB  <   36.011 GB (after GC)
        free :   57.483 GB  <   57.911 +-    0.166 GB  <   58.402 GB
        free :   57.483 GB  <   57.839 +-    0.119 GB  <   57.991 GB (after GC)
      active :   32.367 GB  <   32.796 +-    0.113 GB  <   32.941 GB
      active :   32.777 GB  <   32.867 +-    0.028 GB  <   32.941 GB (after GC)
    inactive :  530.578 MB  <  540.186 +-   14.302 MB  <  589.172 MB
     buffers :    8.234 MB  <    8.707 +-    0.320 MB  <    9.312 MB
      cached :  766.215 MB  <  774.553 +-   14.428 MB  <  825.289 MB
      shared :  379.680 MB  <  388.158 +-   14.145 MB  <  436.152 MB
        slab :    1.137 GB  <    1.156 +-    0.012 GB  <    1.187 GB

at the end:

***** Ran smoothing channel 9
Walltime elapsed since the beginning 1428.4 s
Memory usage 
       total :   94.250 GB  <   94.250 +-    0.000 GB  <   94.250 GB
   available :   39.601 GB  <   40.327 +-    0.205 GB  <   40.539 GB
   available :   40.177 GB  <   40.445 +-    0.101 GB  <   40.544 GB (after GC)
     percent :   57.000 %   <   57.200 +-    0.219 %   <   58.000 % 
     percent :   57.000 %   <   57.100 +-    0.116 %   <   57.400 %  (after GC)
        used :   38.268 GB  <   38.470 +-    0.206 GB  <   39.200 GB
        used :   38.263 GB  <   38.353 +-    0.103 GB  <   38.634 GB (after GC)
        free :   39.989 GB  <   40.716 +-    0.205 GB  <   40.927 GB
        free :   40.566 GB  <   40.832 +-    0.100 GB  <   40.931 GB (after GC)
      active :   36.141 GB  <   36.246 +-    0.180 GB  <   36.906 GB
      active :   36.141 GB  <   36.174 +-    0.025 GB  <   36.246 GB (after GC)
    inactive :   13.996 GB  <   14.005 +-    0.014 GB  <   14.054 GB
     buffers :    9.504 MB  <    9.965 +-    0.309 MB  <   10.516 MB
      cached :   15.040 GB  <   15.049 +-    0.014 GB  <   15.097 GB
      shared :   14.621 GB  <   14.629 +-    0.014 GB  <   14.676 GB
        slab :    1.172 GB  <    1.191 +-    0.012 GB  <    1.222 GB

Now trying with 8GB per MPI process.

zonca commented 5 years ago

works fine also with 8GB timelines per MPI process:

***** Created fake timelines of 8.00 GB per process
Walltime elapsed since the beginning 11.5 s
Memory usage
       total :   94.250 GB  <   94.250 +-    0.000 GB  <   94.250 GB
   available :   25.311 GB  <   25.609 +-    0.225 GB  <   26.557 GB
   available :   25.311 GB  <   25.472 +-    0.108 GB  <   26.098 GB (after GC)
     percent :   71.800 %   <   72.800 +-    0.244 %   <   73.100 %
     percent :   72.300 %   <   73.000 +-    0.115 %   <   73.100 %  (after GC)
        used :   66.527 GB  <   67.475 +-    0.225 GB  <   67.763 GB
        used :   66.985 GB  <   67.611 +-    0.107 GB  <   67.763 GB (after GC)
        free :   25.717 GB  <   26.020 +-    0.226 GB  <   26.970 GB
        free :   25.717 GB  <   25.883 +-    0.109 GB  <   26.512 GB (after GC)
      active :   63.788 GB  <   64.733 +-    0.224 GB  <   64.968 GB
      active :   64.243 GB  <   64.884 +-    0.106 GB  <   64.968 GB (after GC)
    inactive :  531.605 MB  <  533.812 +-    1.932 MB  <  539.012 MB
     buffers :    8.695 MB  <    8.869 +-    0.197 MB  <    9.523 MB
      cached :  762.945 MB  <  765.391 +-    3.761 MB  <  780.059 MB
      shared :  379.426 MB  <  379.451 +-    1.207 MB  <  384.445 MB
        slab :    1.175 GB  <    1.179 +-    0.002 GB  <    1.183 GB

at the end:

***** Ran smoothing channel 9
Walltime elapsed since the beginning 1419.1 s
Memory usage
       total :   94.250 GB  <   94.250 +-    0.000 GB  <   94.250 GB
   available :    7.697 GB  <    8.354 +-    0.186 GB  <    8.496 GB
   available :    8.340 GB  <    8.446 +-    0.038 GB  <    8.496 GB (after GC)
     percent :   91.000 %   <   91.100 +-    0.200 %   <   91.800 %
     percent :   91.000 %   <   91.000 +-    0.061 %   <   91.200 %  (after GC)
        used :   70.315 GB  <   70.456 +-    0.186 GB  <   71.112 GB
        used :   70.315 GB  <   70.366 +-    0.036 GB  <   70.459 GB (after GC)
        free :    8.086 GB  <    8.744 +-    0.186 GB  <    8.884 GB
        free :    8.723 GB  <    8.836 +-    0.039 GB  <    8.884 GB (after GC)
      active :   68.167 GB  <   68.290 +-    0.185 GB  <   68.952 GB
      active :   68.167 GB  <   68.210 +-    0.030 GB  <   68.271 GB (after GC)
    inactive :   13.991 GB  <   14.000 +-    0.006 GB  <   14.017 GB
     buffers :    9.957 MB  <   10.131 +-    0.197 MB  <   10.785 MB
      cached :   15.038 GB  <   15.041 +-    0.004 GB  <   15.057 GB
      shared :   14.621 GB  <   14.621 +-    0.001 GB  <   14.626 GB
        slab :    1.210 GB  <    1.214 +-    0.002 GB  <    1.218 GB

zonca commented 5 years ago

10GB failed with OOM

zonca commented 5 years ago

Looking again to this I realized that the MPI shared memory goes into shared and that is not properly cleared by Python. So I am now calling win.Free() explicitly and memory usage went down. Now also 10 GB fake timelines per node work fine:

***** Ran smoothing channel 9
Walltime elapsed since the beginning 1449.1 s
Memory usage
       total :   94.250 GB  <   94.250 +-    0.000 GB  <   94.250 GB
   available :    5.609 GB  <    6.718 +-    0.273 GB  <    6.950 GB
   available :    6.005 GB  <    6.819 +-    0.214 GB  <    6.957 GB (after GC)
     percent :   92.600 %   <   92.900 +-    0.285 %   <   94.000 %
     percent :   92.600 %   <   92.800 +-    0.228 %   <   93.600 %  (after GC)
        used :   86.117 GB  <   86.348 +-    0.269 GB  <   87.435 GB
        used :   86.110 GB  <   86.245 +-    0.208 GB  <   87.039 GB (after GC)
        free :    6.010 GB  <    7.123 +-    0.274 GB  <    7.356 GB
        free :    6.406 GB  <    7.224 +-    0.215 GB  <    7.362 GB (after GC)
      active :   83.297 GB  <   83.411 +-    0.190 GB  <   84.059 GB
    inactive :  544.371 MB  <  555.404 +-    5.817 MB  <  571.734 MB
     buffers :    9.344 MB  <   10.277 +-    0.418 MB  <   10.945 MB
      cached :  781.066 MB  <  787.771 +-    7.591 MB  <  814.453 MB
      shared :  382.914 MB  <  388.900 +-    5.399 MB  <  406.613 MB
        slab :    1.123 GB  <    1.134 +-    0.012 GB  <    1.164 GB

See how shared memory is reduced now. I'm closing this issue, reopen if needed.

galsci / pysm