LLNL / apollo

Apollo: Online Machine Learning for Performance Portability
Other
22 stars 9 forks source link

Hang when running with HPCToolkit+Apollo #10

Closed DavidPoliakoff closed 3 years ago

DavidPoliakoff commented 3 years ago

On an x86+V100 system, using CUDA 11.0 and the develop branch of Apollo (currently 917cd5e ), Apollo breaks when used as a Kokkos tool if HPCToolkit is in use (a spack install of it). The app was ExaMiniMD, but it should work with just about anything. There are five configurations:

1) ./the_application 2) KOKKOS_PROFILE_LIBRARY=/path/to/libapollo-tuner.so ./the_application 3) hpcrun ./the_application 4) KOKKOS_PROFILE_LIBRARY=/some/other/kokkos-tool.so hpcrun ./the_application 5) KOKKOS_PROFILE_LIBRARY=/path/to/libapollo-tuner.so hpcrun ./the_application

And every one of them works except number 5. Just bizarre.

mwkrentel commented 3 years ago

@DavidPoliakoff You haven't given us much to go on. Does it segfault, deadlock, what?

I'm wondering if there might be an overlap in the dependencies. Do you know, what are the dependencies (eg, spack spec) for libapollo-tuner and ExaMiniMD?

jmellorcrummey commented 3 years ago

There are a few known issues in our master branch that I think we have addressed in our develop branch

Perhaps you can take our develop branch for a spin with your application? The develop branch is using LD_AUDIT to track operations on dynamic libraries. At present, that can cause elevated overhead because glibc doesn't fill in the GOT table because of the potential for there being a PLT auditor. We have fix for x86 (a second auditor that fills in the GOT, but this is not yet in the develop branch and we don't have the patch for Power or ARM yet either).

Issues at the intersection of applications, dynamic libraries, threads, monitoring, and forking are the worst!

DavidPoliakoff commented 3 years ago

@jmellorcrummey

if someone opens dynamic libraries during the program execution rather than at startup, a quirk in the master branch can result in deadlock

Well, KOKKOS_PROFILE_LIBRARY dlopens, for sure.

I just tried develop, and this works now. It's like you guys do this stuff professionally or something ;)

Thanks all, closing