envelope-project / laik

Other
9 stars 8 forks source link

MPI_Backend Change to support Lookup Rank by Hostname #165

Closed twittidai closed 4 years ago

twittidai commented 6 years ago

A global data structure has been created at initialisation time, in order to track down the rank and hostname globally. This data structure can be used for the proposed backend function "get_failed_ranks_by_hostname".

weidendo commented 6 years ago

Is this really MPI-specific? The processor_name already gets passed and can be queried as "location" in the backend-independent LAIK instance struct. And the (not yet working) key-value store is meant to be used to make locally available information globally known, ie. exactly for cases like that.

I do not like to extend the backend interface for enabling quick hacks working around not-yet-existing generic functionality just for very specific cases, because it will be difficult to get rid of such hacks later. I really would suggest to instead implementing just the functionality of the KV store that is needed to make this work, ie. doing the MPI_broadcast stuff in the sync() backend method for a new KV entry type name e.g. LAIK_KV_MergedLocalData. Further, we need to think about garbage collecting such entries when the group gets deleted because replaced by a shrinked group. Or how is this expected to work with this PR?

Anyway, before going that path at all: How is this expected to be used? Instead of making this list globally available, you also could ask every process if it matches a hostname. I think there must be a mechanism to broadcast external requests to all processes anyway, in sync with application progress.

twittidai commented 6 years ago

There is the discussion: Either LAIK is responsible of the location where these treads are working, or the Backend. For me, I think both may work. If this information stays with the backend. As the backend (e.g. after a transparent VM Migration e.g.) may change the physical location, there must be a mechanic that LAIK get this information.

But you are also right, this information may be implemented by the KV-Store. However, given the MPI-Sessions backend, there is no need to get this information sync() at all as you may ask MPI-Session at anytime. Also this is going be encorporated within FLUX as well as PMIx. Therefore I suggest to have the "mapIdbyHostname()" as a Backend Feature that is backend specific.

This is expected to do a SPMD approach of calculation for (shrinking)repartitioning. Given a Hostname is present, each process create the "removeList" and trigger repartitioning. The Syncing and guaranteeing that repartining trigger point is job of application - e.g. by calling "allow_repartitiong" only at "iteration % 5 ==0" or similar.

weidendo commented 6 years ago

If VMs are involved, the MPI backend (using MPI_Get_processor_name) only sees virtual node names, which are not of much use. The real info (which physical host a MPI process within a VM is running on) has to come from the outside, and needs to overwrite the hostname mapping. I would expect this external info to come in via e.g. MQTT. I do not see anything backend-specific here...?

Anyway. I agree it is good to have the mapping duplicated in every process (at least as first step), so that everybody can calculate a shrinked group to switch too. On shrinking, there would be a broadcast to everybody which host to retreat from. For both I think we can use the KV-store.

I'll try to come up with a patch for that - should not be much different than what you have.

I leave this PR open as reminder.

twittidai commented 6 years ago

Hi,

If VMs are involved, the MPI backend (using MPI_Get_processor_name) only sees virtual node names, which are not of much use.

I think this is still make useful, as the Information from outside always have to return name that is know by LAIK, which means an agent must get VM-names in this case, which is also decided in the extended meeting in Aachen last time.

Anyway. I agree it is good to have the mapping duplicated in every process (at least as first step), so that everybody can calculate a shrinked group to switch too. On shrinking, there would be a broadcast to everybody which host to retreat from. For both I think we can use the KV-store.

+1. Sure. Then I will get a MQTT agent running this way.

Thanks.

weidendo commented 4 years ago

The functionality now is provided via KVS sync of "locations", which works both with MPI and the TCP backend as part of the sync() backend interface. For this, (1) call laik_sync_location(instance) - this may be done always in laik_init() in the future (2) get location string for process i in group g (which has the format "hostname:PID") with laik_group_location(g, i). You most probably want to traverse the list of processes in the current world to get the list of processes running on a given host. Note that the current world may change during runtime, and it's better to ask for it with laik_world(instance) before use.

I close this now.