databricks / megablocks

Apache License 2.0
1.11k stars 154 forks source link

RuntimeError: Triton Error [CUDA]: invalid argument #88

Open noob-ctrl opened 5 months ago

noob-ctrl commented 5 months ago

I fllow the next step:

When I run moe_46m_8gpu.sh to test, it reported the following error:

RuntimeError: Triton Error [CUDA]: invalid argument

My environment: image

And how to solve this problem?

tgale96 commented 5 months ago

Hi! Could you share a longer version of the error?

Fwiw, the MoE scripts are a bit out of date. I recommend using dMoE, which should work and will be better anyways :)

noob-ctrl commented 5 months ago

@tgale96 OK,dMoE can work. Does megablocks only support data parallelism and expert parallelism? Does that mean it’s impossible to train a larger model?

tgale96 commented 5 months ago

We support data, expert and pipeline parallelism in our Megatron integration. We have users training some pretty large models :)

noob-ctrl commented 5 months ago

@tgale96 When I want to merge the weight of model,I encountered the following problem: image

My script is:

python tools/checkpoint_util.py \
        --model-type GPT \
        --load-dir /gpudisk1/openmoe/data \
        --save-dir /gpudisk1/openmoe/data/he_checkpoint \
        --target-tensor-parallel-size 1 \
        --target-pipeline-parallel-size 1
tgale96 commented 5 months ago

Can you elaborate on what you're trying to do? That script is from Megatron-LM, presumably?

noob-ctrl commented 5 months ago

@tgale96 After I run dmoe_46m_8gpu.sh script, The saved model is in the following format, with a model_optim_rng.pt in each folder:

image

I want to merge this weights into a single model_optim_rng.pt.

Otherwise, when I want to use model to inference, it seems that loading the model still requires eight gpus?

tgale96 commented 5 months ago

Thanks! I haven't used that script myself, but the error seems to be that the global megatron arguments object isn't initialized. Could you provide a longer form of the error that you're seeing so that I can see where those args are being accessed?

noob-ctrl commented 5 months ago

@tgale96

image

tgale96 commented 5 months ago

Ok, it seems like this script is trying to use APIs that require initialize_megatron to have been called. I'd recommend inserting that at the beginning of the script.

jambo6 commented 3 months ago

I am also getting this error RuntimeError: Triton Error [CUDA]: invalid argument

I'm trying to run

_binned_copy[(num_experts, expert_capacity)](
        x,
        out,
        num_experts,
        expert_capacity,
        indices,
        weights,
        bins,
        NUM_COLUMNS=x.shape[1],
        A_TO_B=True,
        TOP_K=top_k,
        SCALE=weights is not None,
    )

with num_experts=1. Its failing for certain values of expert capacity.

For expert capacity 32768 its ok, for 65536 or above it returns the error.

Any idea @tgale96?

jambo6 commented 3 months ago

Full err

  File "/home/hffxres/triorepo/quant-codes-git/ModelCodes/Futures/RNNModel/main/models/lrumega/common/mega/kernels2.py", line 374, in binned_gather
    _binned_copy[(num_experts, expert_capacity)](
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 97, in run
    timings = {config: self._bench(*args, config=config, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 97, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 80, in _bench
    return do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
  File "/usr/local/lib/python3.10/dist-packages/triton/testing.py", line 44, in do_bench
    fn()
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 78, in kernel_call
    self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
  File "<string>", line 44, in _binned_copy
RuntimeError: Triton Error [CUDA]: invalid argument
Process 0 died - Going to kill everyone
tgale96 commented 3 months ago

Can you file a separate bug and share a repro? I'm happy to take a look.

jambo6 commented 2 months ago

Reproducible from binned_gather_test.py if you make it so expert_capacity exceeds 65535

e.g. just run

    @parameterized.parameters(*_BINNED_GATHER_TESTS)
    def testBinnedGather(self, sl, hs, ne, top_k):
        # NOTE: Capacity factor == 1.
        ec = (sl * top_k) // ne
        ec = 65536

will not fail for 65535 for example

Planning to look at some point, not very familiar with GPU programming

jambo6 commented 2 months ago

Learnt some triton -- above is easily fixed by reducing the dimension of the launch grid in the 2nd dim and computing multiple tokens in one thread.

tgale96 commented 2 months ago

Thanks, James. Ya, the error is from CUDA maximum grid dimension sizes. The first (x) dimension can be up to 2**31 - 1 so another fix (which doesn't change the mapping of work to threads/threadblocks) would be to fold the second dim into the first one.

If you want to submit a fix for this its a nice issue to fix to get started with megablocks! But I can also take a look when I get some free cycles :)