b-remy commented 3 years ago

In order to solve the issue #3 , I adapted the random function in mesh_tensorflow/hvd_simd_mesh_impl.py.

Since GPU have seeds enabled (while TPU have not, see comment in mesh_tensorflow/simd_mesh_impl.py), I ensured that when a seed is specified, it is split along the mesh slices to not have the same tensor everywhere.

Examples:

for a 2x2 mesh: creating a 8x8 tensor with mtf.random_uniform returns:


Final result [[0.6787466  0.7141509  0.46981692 0.4819697 ]
[0.47838175 0.55389535 0.23514807 0.43864334]
[0.48487687 0.83747387 0.66826546 0.8130921 ]
[0.2117182  0.1459359  0.4721594  0.25947487]]
Final result Final result [[0.2390374  0.92039955 0.05051243 0.49574447]
[0.8355223  0.02647042 0.08811307 0.4566604 ]
[0.76883924 0.7376363  0.78504944 0.31202638]
[0.15186465 0.20276117 0.3083856  0.5401472 ]]
[[0.3869294  0.36002624 0.9768796  0.08651638]
[0.5038012  0.3585019  0.20932388 0.02714252]
[0.4244014  0.5076815  0.3446772  0.48250294]
[0.01125383 0.8801354  0.22752082 0.11322451]]
Final result [[0.10086262 0.9701668  0.8487642  0.04828131]
[0.04852307 0.77747464 0.844468   0.41707492]
[0.5099584  0.6552025  0.9881507  0.36698937]
[0.37789786 0.69118714 0.99544394 0.4662125 ]]

- for a 1x2 mesh: creating a 1x2 tensor:

Final result [[23.90374]] Final result [[10.086262]]


So it seems to work well now.

However I could not think of.a case where the tensor would not be distributed and where we would need to broadcast the tensor. See `random` in `mesh_tensorflow/placement_mesh_impl.py`:

seeds are necessary to make sure that slices that should have the

same values actually do have the same values.

EiffL commented 3 years ago

This looks good :-) here is my question, what happens if you have something like

batch_dim = mtf.Dimension("batch", 2)
nx_dim = mtf.Dimension('nx', 8)

a = mtf.random_uniform(shape=[nx_dim])

mesh_shape = [ ("row", 2)]
layout_rules = [('batch', 'row')]

EiffL commented 3 years ago

the nx_dim is not distributed, so we expect that each process has the same thing.

b-remy commented 3 years ago

yeah it does not output the same tensor on both processes:

Final result Final result [0.10086262 0.9701668  0.8487642  0.04828131 0.04852307 0.77747464
 0.844468   0.41707492]
[0.2390374  0.92039955 0.05051243 0.49574447 0.8355223  0.02647042
 0.08811307 0.4566604 ]

The question I ask myself is how can get the same seed along the axis that is not distributed and a different ones for the distributed axes.... ?

DifferentiableUniverseInitiative / mesh

adapt mtf.random to hvd #5

seeds are necessary to make sure that slices that should have the

same values actually do have the same values.