flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.65k stars 224 forks source link

Bug in reduce_sum #723

Open soumyac1999 opened 1 year ago

soumyac1999 commented 1 year ago

Example:

r"""
To get flexflow outputs:
flexflow_python sum_bug_align.py  -ll:py 1 -ll:gpu 1 -ll:fsize 15000 \
     -ll:zsize 4000 --only-data-parallel -b 16

To check with keras:
python sum_bug_align.py

"""

import numpy as np

def top_level_task():
    from flexflow import keras
    import flexflow.keras.optimizers

    intermediate_tensors = []

    inp_x = keras.layers.Input(shape=(15*8*10,))

    x = inp_x  # b, 15
    x = keras.layers.Reshape((15, 8, 10))(x); intermediate_tensors.append(x)
    x = keras.backend.sum(x, axis=2); intermediate_tensors.append(x)

    model = flexflow.keras.models.Model(inp_x, x)

    opt = flexflow.keras.optimizers.SGD(learning_rate=0.01)
    model.compile(optimizer=opt, loss='mean_squared_error',
                  metrics=['mean_squared_error'])

    bs = model.ffconfig.batch_size
    x = np.random.randn(bs, 15*8*10).astype(np.float32)
    y = np.random.randn(bs, 15, 10).astype(np.float32)
    parameters = [l.get_weights(model.ffmodel)
                  for l in model.layers if hasattr(l, 'get_weights')]

    model.fit(x=x, y=y, epochs=1)

    tensor_vals = [t.ffhandle.get_tensor(model.ffmodel)
                   for t in intermediate_tensors]

    return bs, parameters, x, y, tensor_vals

def match(i, x, y):
    # x - ff, y - keras
    y = y.numpy()
    print('Matching', i, np.abs(x-y).max(), np.abs((x-y)/(x+1e-20)).max(),
          100*np.sum(x!=y)/x.size)
    # idx = np.where(np.abs((x - y)/(x+1e-20)) > 1e-2)
    # print(idx)
    # print(x[idx].ravel()[:10])
    # print(y[idx].ravel()[:10])

def keras_task(bs, parameters, inps, y, tensor_vals):
    import tensorflow.keras as keras

    idx = 0
    x = inps  # b, 15
    x = keras.layers.Reshape((15, 8, 10))(x); match('reshape', tensor_vals[idx], x); prev_x=x; x=tensor_vals[idx]; idx+=1
    x = keras.backend.sum(x, axis=2); match('sum', tensor_vals[idx], x); prev_x=x; x=tensor_vals[idx]; idx+=1

if __name__ == '__main__':
    import pickle
    import sys
    if len(sys.argv) > 1:
        t = top_level_task()
        pickle.dump(t, open('dump_sum.pkl', 'wb'))
    else:
        t = pickle.load(open('dump_sum.pkl', 'rb'))
        keras_task(*t)

Sample output: Max absolute error: 9.536743e-07 Max relative error: 8.060183e-05 Element differing: 59.875%

jiazhihao commented 1 year ago

@soumyac1999 I am seeing an input tensor shape of [16, 15, 8, 10] and an output tensor shape of [16, 15, 1, 10] for the ReduceSum operator in FlexFlow. Do you think the tensor shapes are correct?

soumyac1999 commented 1 year ago

The input shape looks correct. But with keepdims=False (https://github.com/flexflow/FlexFlow/blob/master/python/flexflow/keras/backend/backend_functions.py#L39), I would expect the output shape to be [16, 5, 10]

jiazhihao commented 1 year ago

That's right. The reduced dim will be eliminated and the output shape will be [16, 15, 10]

jiazhihao commented 1 year ago

@soumyac1999 I think the misalignment is caused by floating point errors of different ReduceSum implementations --- the maximum error is 9.536743e-07, which should be acceptable for training as DNN training is generally robust to floating point errors. Have you experienced any accuracy issue caused by this?

soumyac1999 commented 1 year ago

@jiazhihao, for atleast some inputs, the differences are much larger which make me think it might not be just floating point errors. For example with the attached inputs, the max relative error is 0.0038.

Regarding accuracy issues due to this, the activations are blowing up to large positive/negative values which is making learning difficult. However, I haven't ruled out other sources of error.

Code:

r"""
To get flexflow outputs:
flexflow_python sum_bug_align.py  -ll:py 1 -ll:gpu 1 -ll:fsize 15000 \
     -ll:zsize 4000 --only-data-parallel -b 16

To check with keras/torch:
python sum_bug_align.py

"""

import numpy as np

def top_level_task():
    from flexflow import keras
    import flexflow.keras.optimizers

    intermediate_tensors = []

    inp_x = keras.layers.Input(shape=(512*16*256,))

    x = inp_x
    x = keras.layers.Reshape((512, 16, 256))(x); intermediate_tensors.append(x)
    x = keras.backend.sum(x, axis=2); intermediate_tensors.append(x)

    model = flexflow.keras.models.Model(inp_x, x)

    opt = flexflow.keras.optimizers.SGD(learning_rate=0.01)
    model.compile(optimizer=opt, loss='mean_squared_error',
                  metrics=['mean_squared_error'])

    bs = model.ffconfig.batch_size
    assert bs == 16
    x = np.load('sum_data.npy').reshape(16, 512*16*256)
    y = np.random.randn(bs, 512, 256).astype(np.float32)
    parameters = [l.get_weights(model.ffmodel)
                  for l in model.layers if hasattr(l, 'get_weights')]

    model.fit(x=x, y=y, epochs=1)

    tensor_vals = [t.ffhandle.get_tensor(model.ffmodel)
                   for t in intermediate_tensors]

    return bs, parameters, x, y, tensor_vals

def match(i, x, y):
    # x - ff, y - keras
    y = y.numpy()
    print('Matching', i, np.abs(x-y).max(), np.abs((x-y)/(x+1e-20)).max(),
          100*np.sum(x!=y)/x.size)
    # idx = np.where(np.abs((x - y)/(x+1e-20)) > 1e-2)
    # print(idx)
    # print(x[idx].ravel()[:10])
    # print(y[idx].ravel()[:10])

def keras_task(bs, parameters, inps, y, tensor_vals):
    import tensorflow.keras as keras

    idx = 0
    x = inps
    x = keras.layers.Reshape((512, 16, 256))(x); match('reshape', tensor_vals[idx], x); prev_x=x; x=tensor_vals[idx]; idx+=1
    x = keras.backend.sum(x, axis=2); match('sum', tensor_vals[idx], x)

if __name__ == '__main__':
    import pickle
    import sys
    if len(sys.argv) > 1:
        t = top_level_task()
        pickle.dump(t, open('dump_sum.pkl', 'wb'))
    else:
        t = pickle.load(open('dump_sum.pkl', 'rb'))
        keras_task(*t)

Inputs: https://drive.google.com/file/d/1c59Su2R5NiXdZYZ7HjTBIG0Ut28TpHoI/view?usp=share_link

jiazhihao commented 1 year ago

the max relative error is 0.0038.

If the maximum relative error is 0.38%, I would say this is likely caused by floating point errors, since our implementation directly uses cuDNN: https://github.com/flexflow/FlexFlow/blob/master/src/ops/reduce.cu#L53-L71. Any configuration error would result in much larger relative errors.

the activations are blowing up to large positive/negative values which is making learning difficult

Do you want to find a time next week to work on this together?

soumyac1999 commented 1 year ago

Got it, thanks. I'll try to spend some more time trying to check if I can find some patterns etc. And will get back to you if not.

lockshaw commented 1 year ago

@soumyac1999 Did we end up concluding that this was just floating point error?