bsc-pm / dlb

DLB (Dynamic Load Balancing) library is a tool, transparent to the user, that will dynamically react to the application imbalance modifying the number of resources at any given time.
https://pm.bsc.es/dlb
GNU Lesser General Public License v3.0
21 stars 6 forks source link

DLB_TALP_Attach() creates the shared-memory segment if it does not exist yet #7

Open kingshuk00 opened 1 year ago

kingshuk00 commented 1 year ago

Calling DLB_TALP_Attach() from outside calls shmem_cpuinfo_ext__init() and shmem_procinfo_ext__init(). They both call open_shmem() -> shmem_init() -> shm_open()+ ftruncate(). This creates the segment even when it does not exist. Perhaps something similar to what is done in DLB_DROM_PreInit() (calls shmem_procinfo_ext__preinit()) to check for its existence would be helpful. Or, it can be checked from /dev/shm as well.

kingshuk00 commented 12 months ago

I created a commit on my fork (link). Could you please suggest whether this is not the intended behaviour.

vlopezh commented 12 months ago

I'm not sure. The thing with DLB_TALP_Attach and other mechanisms for attaching from a 3rd party process is that is asynchronous.

Imagine you initiate first just a monitor program. If there's no other DLB program running, the monitor program will exit with an error because there's no shared memory to attach to.

With the suggested change, one needs to start an application that uses TALP before a third-party program may attach to it. Whereas now, the third-party program may start, and sit idle waiting for TALP processes to start and monitor.

Is calling DLB_TALP_Attach() and creating an empty shared memory causing any problem?

kingshuk00 commented 12 months ago

My monitoring code looks like:

DLB_TALP_Attach();
DLB_TALP_GetNumCPUs(&nprocs);
pids= (int *) malloc(sizeof(int)* nprocs);
DLB_TALP_GetPidList(pids, &nelems, nprocs);
while( 0 == nelems ) {
    usleep( 500000 );
    DLB_TALP_GetPidList(pids, &nelems, nprocs);
}

while( !kill(pids[0], 0) ) {
    error = DLB_TALP_GetTimes(pids[0], &mpi_time, &useful_time);
    if (error != DLB_SUCCESS) break;

    printf("%d, mpi time: %g; useful time: %g\n", pids[0], mpi_time, useful_time);
    usleep( 500000 );
}

DLB_TALP_Detach();
if( NULL != pids ) {
    free(pids);
    pids = NULL;
}

If I run the following command:

$ export DLB_ARGS="--talp --talp-external-profiler --verbose=shmem"
$ mpirun -np 3 env LD_PRELOAD=<dlb-install-dir>/lib/libdlb_mpi.so ./executable executable-options
kingshuk00 commented 12 months ago

By the way, thank you for exaplaining the intended behaviour. This is helpful and I understand now that ideally creating a shared-memory is not a problem as long as it works as intended.

vlopezh commented 12 months ago

Oh, I see. There's a bug when DLB_TALP_Attach creates the shared memory and other processes expect a certain value which is not set. DLB warns about it and I think TALP is never enabled in this shared memory, that's why the external process doesn't see any other process:

DLB could not initialize the shared memory due to incompatible options among processes, likely ones sharing CPUs and others not. Please, if you believe this is a bug contact us at pm-tools@bsc.es

If you need it to work right now, I can think of a workaround:

diff --git a/src/LB_comm/shmem_procinfo.c b/src/LB_comm/shmem_procinfo.c
index 04ab8e4..7bb9a59 100644
--- a/src/LB_comm/shmem_procinfo.c
+++ b/src/LB_comm/shmem_procinfo.c
@@ -244,7 +244,8 @@ static int shmem_procinfo__init_(pid_t pid, pid_t preinit_pid, const cpu_set_t *
             if (shdata->allow_cpu_sharing != allow_cpu_sharing) {
                 // For now we require all processes registering the procinfo
                 // to have the same value in 'allow_cpu_sharing'
-                error = DLB_ERR_NOCOMP;
+                // error = DLB_ERR_NOCOMP;
+                shdata->allow_cpu_sharing = allow_cpu_sharing;
             }
         }

In any case, in the following days I will try to upload a proper fix. Thanks.

kingshuk00 commented 11 months ago

Thanks Victor for the intermediate fix. I can confirm that this works. There is a need to explicitly create a DLB-monitoring region with a very specific name ("MPI Region", like in the following line) for the external monitoring program to fetch meaningful MPI and useful time. dlb_monitor_t *mon= DLB_MonitoringRegionRegister("MPI Region"); I figured this by examining DLB_TALP_Attach(). Otherwise, calling DLB_TALP_Attach() registers a region called "MPI Region" in talp, but not as a monitor and hence not updated from talp_[into/out_of]_sync_call() (nregions is not updated in DLB_talp.c). Would it be possible for you to suggest whether this is a related issue? Otherwise, I shall create another issue.

vlopezh commented 11 months ago

Right, I've done some tests with an external profiler doing DLB_TALP_Attach() and obtaining metrics from the region is not working as it should.

Thanks for pointing it out, I will do a fix for all these things in this issue, no need for creating another for now.

vlopezh commented 11 months ago

I think it should be fixed, but let us know if you find anything. You can also undo the workaround in LB_comm/shmem_procinfo.c if you update your main branch.

We've also implemented a function to do DLB_TALP_GetPidList + DLB_TALP_GetTimes at once, should you find it useful. A small pseudo-code example of a profiler would be:

Using DLB_TALP_GetPidList + DLB_TALP_GetTimes :

DLB_TALP_Attach();
while(...) {
    int pidlist[MAX_PROCS], nelems;
    DLB_TALP_GetPidList(pidlist, &nelems, MAX_PROCS);
    for(n in nelems) {
        double mpi_time, useful_time;
        if (DLB_TALP_GetTimes(pid, &mpi_time, &useful_time) == DLB_SUCCESS) {
            printf("Found pid: %d, mpi_time: %f s, useful_time: %f s\n",
                    pid, mpi_time, useful_time);
        }
    }
}
DLB_TALP_Detach();

Using DLB_TALP_GetNodeTimes :

DLB_TALP_Attach();
while(...) {
    dlb_node_times_t node_times[MAX_PROCS];
    DLB_TALP_GetNodeTimes(DLB_MPI_REGION, node_times, &nelems, MAX_PROCS);
    for(n in nelems) {
        printf("Found pid: %d, mpi_time: %"PRId64" ns, useful_time: %"PRId64" ns\n",
                node_times[n].pid,
                node_times[n].mpi_time,
                node_times[n].useful_time);
    }
}
DLB_TALP_Detach();

You could also call DLB_TALP_QueryPOPNodeMetrics to obtain synthesized node metrics. Also, let us know if these features cover your use case. Thanks.