Open noob-ctrl opened 5 months ago
Hi! Could you share a longer version of the error?
Fwiw, the MoE scripts are a bit out of date. I recommend using dMoE, which should work and will be better anyways :)
@tgale96 OK,dMoE can work. Does megablocks only support data parallelism and expert parallelism? Does that mean it’s impossible to train a larger model?
We support data, expert and pipeline parallelism in our Megatron integration. We have users training some pretty large models :)
@tgale96 When I want to merge the weight of model,I encountered the following problem:
My script is:
python tools/checkpoint_util.py \
--model-type GPT \
--load-dir /gpudisk1/openmoe/data \
--save-dir /gpudisk1/openmoe/data/he_checkpoint \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1
Can you elaborate on what you're trying to do? That script is from Megatron-LM, presumably?
@tgale96 After I run dmoe_46m_8gpu.sh
script, The saved model is in the following format, with a model_optim_rng.pt
in each folder:
I want to merge this weights into a single model_optim_rng.pt
.
Otherwise, when I want to use model to inference, it seems that loading the model still requires eight gpus?
Thanks! I haven't used that script myself, but the error seems to be that the global megatron arguments object isn't initialized. Could you provide a longer form of the error that you're seeing so that I can see where those args are being accessed?
@tgale96
Ok, it seems like this script is trying to use APIs that require initialize_megatron to have been called. I'd recommend inserting that at the beginning of the script.
I am also getting this error
RuntimeError: Triton Error [CUDA]: invalid argument
I'm trying to run
_binned_copy[(num_experts, expert_capacity)](
x,
out,
num_experts,
expert_capacity,
indices,
weights,
bins,
NUM_COLUMNS=x.shape[1],
A_TO_B=True,
TOP_K=top_k,
SCALE=weights is not None,
)
with num_experts=1. Its failing for certain values of expert capacity.
For expert capacity 32768 its ok, for 65536 or above it returns the error.
Any idea @tgale96?
Full err
File "/home/hffxres/triorepo/quant-codes-git/ModelCodes/Futures/RNNModel/main/models/lrumega/common/mega/kernels2.py", line 374, in binned_gather
_binned_copy[(num_experts, expert_capacity)](
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 97, in run
timings = {config: self._bench(*args, config=config, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 97, in <dictcomp>
timings = {config: self._bench(*args, config=config, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 80, in _bench
return do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
File "/usr/local/lib/python3.10/dist-packages/triton/testing.py", line 44, in do_bench
fn()
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 78, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
File "<string>", line 44, in _binned_copy
RuntimeError: Triton Error [CUDA]: invalid argument
Process 0 died - Going to kill everyone
Can you file a separate bug and share a repro? I'm happy to take a look.
Reproducible from binned_gather_test.py
if you make it so expert_capacity exceeds 65535
e.g. just run
@parameterized.parameters(*_BINNED_GATHER_TESTS)
def testBinnedGather(self, sl, hs, ne, top_k):
# NOTE: Capacity factor == 1.
ec = (sl * top_k) // ne
ec = 65536
will not fail for 65535 for example
Planning to look at some point, not very familiar with GPU programming
Learnt some triton -- above is easily fixed by reducing the dimension of the launch grid in the 2nd dim and computing multiple tokens in one thread.
Thanks, James. Ya, the error is from CUDA maximum grid dimension sizes. The first (x
) dimension can be up to 2**31 - 1
so another fix (which doesn't change the mapping of work to threads/threadblocks) would be to fold the second dim into the first one.
If you want to submit a fix for this its a nice issue to fix to get started with megablocks! But I can also take a look when I get some free cycles :)
I fllow the next step:
When I run
moe_46m_8gpu.sh
to test, it reported the following error:My environment:![image](https://github.com/stanford-futuredata/megablocks/assets/63763578/7c9f5c5f-e698-4e83-843c-c06dab7bb9a7)
And how to solve this problem?