DifferentiableUniverseInitiative / mesh

Mesh TensorFlow: Model Parallelism Made Easier
Apache License 2.0
0 stars 0 forks source link

adapt mtf.random to hvd #5

Closed b-remy closed 3 years ago

b-remy commented 3 years ago

In order to solve the issue #3 , I adapted the random function in mesh_tensorflow/hvd_simd_mesh_impl.py.

Since GPU have seeds enabled (while TPU have not, see comment in mesh_tensorflow/simd_mesh_impl.py), I ensured that when a seed is specified, it is split along the mesh slices to not have the same tensor everywhere.

Examples:

- for a 1x2 mesh: creating a 1x2 tensor:

Final result [[23.90374]] Final result [[10.086262]]


So it seems to work well now.

However I could not think of.a case where the tensor would not be distributed and where we would need to broadcast the tensor. See `random` in `mesh_tensorflow/placement_mesh_impl.py`:

seeds are necessary to make sure that slices that should have the

same values actually do have the same values.

EiffL commented 3 years ago

This looks good :-) here is my question, what happens if you have something like

batch_dim = mtf.Dimension("batch", 2)
nx_dim = mtf.Dimension('nx', 8)

a = mtf.random_uniform(shape=[nx_dim])

mesh_shape = [ ("row", 2)]
layout_rules = [('batch', 'row')]
EiffL commented 3 years ago

the nx_dim is not distributed, so we expect that each process has the same thing.

b-remy commented 3 years ago

yeah it does not output the same tensor on both processes:

Final result Final result [0.10086262 0.9701668  0.8487642  0.04828131 0.04852307 0.77747464
 0.844468   0.41707492]
[0.2390374  0.92039955 0.05051243 0.49574447 0.8355223  0.02647042
 0.08811307 0.4566604 ]

The question I ask myself is how can get the same seed along the axis that is not distributed and a different ones for the distributed axes.... ?