Open soumyac1999 opened 1 year ago
@soumyac1999 I am seeing an input tensor shape of [16, 15, 8, 10]
and an output tensor shape of [16, 15, 1, 10]
for the ReduceSum operator in FlexFlow. Do you think the tensor shapes are correct?
The input shape looks correct. But with keepdims=False (https://github.com/flexflow/FlexFlow/blob/master/python/flexflow/keras/backend/backend_functions.py#L39), I would expect the output shape to be [16, 5, 10]
That's right. The reduced dim will be eliminated and the output shape will be [16, 15, 10]
@soumyac1999 I think the misalignment is caused by floating point errors of different ReduceSum implementations --- the maximum error is 9.536743e-07, which should be acceptable for training as DNN training is generally robust to floating point errors. Have you experienced any accuracy issue caused by this?
@jiazhihao, for atleast some inputs, the differences are much larger which make me think it might not be just floating point errors. For example with the attached inputs, the max relative error is 0.0038.
Regarding accuracy issues due to this, the activations are blowing up to large positive/negative values which is making learning difficult. However, I haven't ruled out other sources of error.
Code:
r"""
To get flexflow outputs:
flexflow_python sum_bug_align.py -ll:py 1 -ll:gpu 1 -ll:fsize 15000 \
-ll:zsize 4000 --only-data-parallel -b 16
To check with keras/torch:
python sum_bug_align.py
"""
import numpy as np
def top_level_task():
from flexflow import keras
import flexflow.keras.optimizers
intermediate_tensors = []
inp_x = keras.layers.Input(shape=(512*16*256,))
x = inp_x
x = keras.layers.Reshape((512, 16, 256))(x); intermediate_tensors.append(x)
x = keras.backend.sum(x, axis=2); intermediate_tensors.append(x)
model = flexflow.keras.models.Model(inp_x, x)
opt = flexflow.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=opt, loss='mean_squared_error',
metrics=['mean_squared_error'])
bs = model.ffconfig.batch_size
assert bs == 16
x = np.load('sum_data.npy').reshape(16, 512*16*256)
y = np.random.randn(bs, 512, 256).astype(np.float32)
parameters = [l.get_weights(model.ffmodel)
for l in model.layers if hasattr(l, 'get_weights')]
model.fit(x=x, y=y, epochs=1)
tensor_vals = [t.ffhandle.get_tensor(model.ffmodel)
for t in intermediate_tensors]
return bs, parameters, x, y, tensor_vals
def match(i, x, y):
# x - ff, y - keras
y = y.numpy()
print('Matching', i, np.abs(x-y).max(), np.abs((x-y)/(x+1e-20)).max(),
100*np.sum(x!=y)/x.size)
# idx = np.where(np.abs((x - y)/(x+1e-20)) > 1e-2)
# print(idx)
# print(x[idx].ravel()[:10])
# print(y[idx].ravel()[:10])
def keras_task(bs, parameters, inps, y, tensor_vals):
import tensorflow.keras as keras
idx = 0
x = inps
x = keras.layers.Reshape((512, 16, 256))(x); match('reshape', tensor_vals[idx], x); prev_x=x; x=tensor_vals[idx]; idx+=1
x = keras.backend.sum(x, axis=2); match('sum', tensor_vals[idx], x)
if __name__ == '__main__':
import pickle
import sys
if len(sys.argv) > 1:
t = top_level_task()
pickle.dump(t, open('dump_sum.pkl', 'wb'))
else:
t = pickle.load(open('dump_sum.pkl', 'rb'))
keras_task(*t)
Inputs: https://drive.google.com/file/d/1c59Su2R5NiXdZYZ7HjTBIG0Ut28TpHoI/view?usp=share_link
the max relative error is 0.0038.
If the maximum relative error is 0.38%, I would say this is likely caused by floating point errors, since our implementation directly uses cuDNN: https://github.com/flexflow/FlexFlow/blob/master/src/ops/reduce.cu#L53-L71. Any configuration error would result in much larger relative errors.
the activations are blowing up to large positive/negative values which is making learning difficult
Do you want to find a time next week to work on this together?
Got it, thanks. I'll try to spend some more time trying to check if I can find some patterns etc. And will get back to you if not.
@soumyac1999 Did we end up concluding that this was just floating point error?
Example:
Sample output: Max absolute error: 9.536743e-07 Max relative error: 8.060183e-05 Element differing: 59.875%