[alpaka] Rewrite the initialisation of platforms and devices

fwyzard commented 1 year ago

Move the platform and device objects for the host and the accelerators to function static variables, initialised by the first call to their functions.

Change

cms::alpakatools::initialise<Platform>(verbose);

to

ALPAKA_ACCELERATOR_NAMESPACE::initialise(verbose);

The underlying platform() and devices() are still templated on the Platform, so they are initialised only once even if there are multiple back-ends that share the same platform (e.g. serial and tbb).

Split alpakaDevices.h into alpaka/devices.h/alpaka/devices.cc and host.h/host.cc.

Split alpakaConfig.h into config.h and common.h. The latter includes only the definitions that are independent from any back-end, in the alpaka_common namespace.

Finally, rename

alpakaWorkDiv.h to workdivision.h,
alpakaMemory.h to memory.h.

fwyzard commented 1 year ago

Here are some attempts at playing with templates.

I'm using g++ 11.4 and clang++ 18 (trunk).

The client code

`test.cc`

#include <iostream>
#include <string>

#include "template.h"

int main(void) {
  std::cout << get_value<int>() << std::endl;
  std::cout << get_value<int>() << std::endl;
  std::cout << get_value<int>() << std::endl;

  std::cout << std::endl;

  std::cout << get_value<unsigned short>() << std::endl;
  std::cout << get_value<unsigned short>() << std::endl;

  return 0;
}

The usual approach

`template.h`

#include <atomic>

template <typename T>
T get_value() {
  static std::atomic<T> value = 0;
  return value++;
}

This works, of course.

`extern` templates

This is the approach with explicit declaration and instantiation of templates as I understood it.

`template.h`

template <typename T>
T get_value();

extern template
int get_value<int>();

extern template
unsigned short get_value<unsigned short>();

`template.cc`

#include <atomic>

#include "template.h"

template <typename T>
T get_value() {
  static std::atomic<T> value = 0;
  return value++;
}

template
int get_value<int>();

template
unsigned short get_value<unsigned short>();

This works, resulting in

$ nm -A -C *.o | grep get_value
template.cc.o: 0000000000000000 W int get_value<int>()
template.cc.o: 0000000000000000 W unsigned short get_value<unsigned short>()
template.cc.o: 0000000000000000 u get_value<int>()::value
template.cc.o: 0000000000000000 u get_value<unsigned short>()::value
test.cc.o:                      U int get_value<int>()
test.cc.o:                      U unsigned short get_value<unsigned short>()

Removing the declarations

Removing the extern declarations also works:

`template.h`

template <typename T>
T get_value();

`template.cc`

#include <atomic>

#include "template.h"

template <typename T>
T get_value() {
  static std::atomic<T> value = 0;
  return value++;
}

template
int get_value<int>();

template
unsigned short get_value<unsigned short>();

and results in the same symbols:

$ nm -A -C *.o | grep get_value
template.cc.o: 0000000000000000 W int get_value<int>()
template.cc.o: 0000000000000000 W unsigned short get_value<unsigned short>()
template.cc.o: 0000000000000000 u get_value<int>()::value
template.cc.o: 0000000000000000 u get_value<unsigned short>()::value
test.cc.o:                      U int get_value<int>()
test.cc.o:                      U unsigned short get_value<unsigned short>()

Only one instantiation ?

I've tried using only one instantiation, to see if that would fail at compile or link time:

`template.h`

template <typename T>
T get_value();

extern template
int get_value<int>();

`template.cc`

#include <atomic>

#include "template.h"

template <typename T>
T get_value() {
  static std::atomic<T> value = 0;
  return value++;
}

template
int get_value<int>();

This compiles, and fails at link time:

g++ -std=c++17 -O3 -g -Wall -fPIC -march=native -mtune=native  -c -MMD template.cc -o template.cc.o
g++ -std=c++17 -O3 -g -Wall -fPIC -march=native -mtune=native  -c -MMD test.cc -o test.cc.o
g++ -std=c++17 -O3 -g -Wall -fPIC -march=native -mtune=native template.cc.o test.cc.o    -o test
/usr/bin/ld: test.cc.o: in function `main':
/home/fwyzard/test/extern_template/v2_bad/test.cc:13: undefined reference to `unsigned short get_value<unsigned short>()'
/usr/bin/ld: /home/fwyzard/test/extern_template/v2_bad/test.cc:14: undefined reference to `unsigned short get_value<unsigned short>()'
collect2: error: ld returned 1 exit status

Specialisation vs instantiation ?

Finally, the approach from this PR:

`template.h`

#ifndef template_h
#define template_h

template <typename T>
T get_value();

`template.cc`

#include <atomic>

#include "template.h"

template <>
int get_value<int>() {
  static std::atomic<int> value = 0;
  return value++;
}

template <>
unsigned short get_value() {
  static std::atomic<unsigned short> value = 0;
  return value++;
}

This works and produces slightly different symbols:

$ nm -A -C *.o | grep get_value
template.cc.o: 0000000000000000 T int get_value<int>()
template.cc.o: 0000000000000020 T unsigned short get_value<unsigned short>()
template.cc.o: 0000000000000004 b get_value<int>()::value
template.cc.o: 0000000000000000 b get_value<unsigned short>()::value
test.cc.o:                      U int get_value<int>()
test.cc.o:                      U unsigned short get_value<unsigned short>()

Now the template specialisations are marked as T (code) instead of W (weak), and the function static variables are marked as b (BSS local) instead of u (unique global symbol).

If we use the explicit template definitions (with or without the extern declarations) it's like we are using inline functions: the linker has to resolve the duplications at link time, and GCC uses the special u symbol to instruct the dynamic linker to keep a single instance of the static variables (clang++ marks them as V, for explicit weak objects).

If we use the explicit template specialisation it's like we are using normal functions: they should be defined in a single translation unit, and the static variables are marked as local.

Keeping in mind also the potential use cases in CMSSW, What do we prefer for libraries ?

fwyzard commented 1 year ago

Here are the results of some follow up checks.

When loading shared libraries with RTLD_LAZY | RTLD_GLOBAL (as we do in CMSSW) all approaches work correctly.

When loading shared libraries with RTLD_LAZY | RTLD_LOCAL:

"The usual approach", "extern templates", and "Removing the declarations" work as expected;
"Specialisation vs instantiation ?" results in separate instances of the static variable.

So, right now my preference would be to use the "Removing the declarations" approach.

makortel commented 1 year ago

After spending some time with Chris on trying to understand various aspects of the topic we agree with the "Removing the declarations" approach.

fwyzard commented 1 year ago

Thanks, I've updated the platform() and devices() part accordingly.

I think I found a better solution for the initialise() functions.

makortel commented 1 year ago

Looks good to me (I also like the new way to handle initialise())

cms-patatrack / pixeltrack-standalone