gpu memory usage increment of the model graphix-3b

kanseaveg commented 1 year ago

Hello, does the graphix-3b model your team released have a bug of gpu memory accumulation? I have tried many times and found this problem. Can you solve this bug? Thank you.

By the way, to quote a sentence from your paper, not all research centers have A100 graphics cards, especially in the age of AIGC. LOL.

huybery commented 1 year ago

It is not quite clear why this is a bug, as the model continues to be trained, it is normal for the GPU memory to keep growing. For the T5-3B model, A100 is a necessary requirement, as is well known in any paper.

kanseaveg commented 1 year ago

I just changed the model in graphix-3b to t5-small instead of t5-large as pre-trained model.

In this way, I can run graphix-3b on a consumer-grade gpu, but it seems that there is a problem of gpu memory accumulation during the running process

I have set the batch_size to 1. The size of my gpu memory is 24G.

I found that even if the pre-training model is changed to t5-small, your code will still have the problem of gpu memory growth. I don’t know where the gpu memory growth problem occurs. It may be that the gpu memory has been accumulating when calculating the graph.

The parameters I set:

{
    "run_name": "g5-small-db-id",
    "model_name_or_path": "/app/data_all_in/t5-small",
    "dataset": "spider",
    "source_prefix": "",
    "schema_serialization_type": "peteshaw",
    "schema_serialization_randomized": false,
    "schema_serialization_with_db_id": true,
    "schema_serialization_with_db_content": true,
    "normalize_query": true,
    "target_with_db_id": true,
    "output_dir": "/train_db_id",
    "cache_dir": "/transformers_cache",
    "do_train": true,
    "do_eval": true,
    "fp16": false,
    "num_train_epochs": 200,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 8,
    "label_smoothing_factor": 0.0,
    "learning_rate": 5e-5,
    "adafactor": true,
    "adam_eps": 1e-6,
    "warmup_ratio": 0.0,
    "warmup_steps": 0,
    "seed": 1,
    "report_to": ["wandb"],
    "logging_strategy": "steps",
    "logging_first_step": true,
    "logging_steps": 4,
    "load_best_model_at_end": true,
    "metric_for_best_model": "exact_match",
    "greater_is_better": true,
    "save_total_limit": 50,
    "save_steps": 500,
    "evaluation_strategy": "steps",
    "eval_steps": 500,
    "predict_with_generate": true,
    "num_beams": 1,
    "num_beam_groups": 1,
    "use_picard": false,
    "overwrite_output_dir": true,
    "input_max_length": 1024,
    "generation_max_length": 128
}

The key bug log is:

dgl._ffi.base.DGLError: Caught DGLError in replica 1 on device 1.
dgl._ffi.base.DGLError: [22:38:02] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:97: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: out of memory

the full bug logs:

File "seq2seq/run_seq2seq_train.py", line 290, in <module>
    main()
  File "seq2seq/run_seq2seq_train.py", line 235, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1400, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1984, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 2016, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
dgl._ffi.base.DGLError: Caught DGLError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/seq2seq/models/graphix/rgat.py", line 104, in forward
    graph_batch = graph_batch,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/seq2seq/models/modeling_t5.py", line 1671, in forward
    relation_emb=relation_emb # TODO: Jinyang
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/seq2seq/models/modeling_t5.py", line 1103, in forward
    relation_emb=self.relation_emb
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/seq2seq/models/modeling_t5.py", line 771, in forward
    hidden_states = self.rgat_layer(hidden_states, graph_batch, relation_emb)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/seq2seq/models/modeling_t5.py", line 343, in forward
    graph_rep = self.graph_caption(hs_norm, graph_batch, relation_emb)
  File "/app/seq2seq/models/modeling_t5.py", line 360, in graph_caption
    graph['edges'], relation_emb)
  File "/app/seq2seq/models/modeling_t5.py", line 371, in graph_caption_one
    struct_rep, edge_feats = self.rgat(node_feats, edge_feats, graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/seq2seq/models/graphix/rgat_tuning.py", line 38, in forward
    out_x = self.propagate_attention(g)
  File "/app/seq2seq/models/graphix/rgat_tuning.py", line 50, in propagate_attention
    g.update_all(src_sum_edge_mul_edge('v', 'e', 'score', 'v'), fn.sum('v', 'wv'))
  File "/opt/conda/lib/python3.7/site-packages/dgl/heterograph.py", line 4895, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/opt/conda/lib/python3.7/site-packages/dgl/core.py", line 369, in message_passing
    ndata = invoke_gspmm(g, fn.copy_e(msg, msg), rfunc, edata=msgdata)
  File "/opt/conda/lib/python3.7/site-packages/dgl/core.py", line 332, in invoke_gspmm
    z = op(graph, x)
  File "/opt/conda/lib/python3.7/site-packages/dgl/ops/spmm.py", line 191, in func
    return gspmm(g, 'copy_rhs', reduce_op, None, x)
  File "/opt/conda/lib/python3.7/site-packages/dgl/ops/spmm.py", line 77, in gspmm
    lhs_data, rhs_data)
  File "/opt/conda/lib/python3.7/site-packages/dgl/backend/pytorch/sparse.py", line 757, in gspmm
    return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 219, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/dgl/backend/pytorch/sparse.py", line 126, in forward
    out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
  File "/opt/conda/lib/python3.7/site-packages/dgl/sparse.py", line 233, in _gspmm
    arg_e_nd)
  File "dgl/_ffi/_cython/./function.pxi", line 293, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 239, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [22:38:02] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:97: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: out of memory
Stack trace:
  [bt] (0) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f164f2d6eaf]
  [bt] (1) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::AllocDataSpace(DLContext, unsigned long, unsigned long, DLDataType)+0x108) [0x7f164f7ad528]
  [bt] (2) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::runtime::WorkspacePool::AllocWorkspace(DLContext, unsigned long)+0x154) [0x7f164f6338f4]
  [bt] (3) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(std::pair<dgl::runtime::NDArray, dgl::runtime::NDArray> dgl::aten::impl::Sort<(DLDeviceType)2, int>(dgl::runtime::NDArray, int)+0x137) [0x7f164f7dd877]
  [bt] (4) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::Sort(dgl::runtime::NDArray, int)+0x3fa) [0x7f164f2bb0aa]
  [bt] (5) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(void dgl::aten::impl::COOSort_<(DLDeviceType)2, int>(dgl::aten::COOMatrix*, bool)+0x5b) [0x7f164f7dfa6b]
  [bt] (6) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::COOSort_(dgl::aten::COOMatrix*, bool)+0x374) [0x7f164f2b6454]
  [bt] (7) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::COOSort(dgl::aten::COOMatrix, bool)+0x48c) [0x7f164f3007bc]
  [bt] (8) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DLDeviceType)2, int>(dgl::aten::COOMatrix)+0xcc) [0x7f164f7ddc2c]

What's wrong with my code reproduction?

Later, I modified a piece of code in forward in the source code rgat.py, I found that there is still the problem of memory increment.


    def forward(self, input_ids, attention_mask, labels, **kwargs):
        graph_batch = self.graph_factory(kwargs)
        # self.relation_init_prompt(self.rel2id)
        loss = self.pretrain_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            use_cache=False,
            labels=labels,
            graph_batch = graph_batch,
            # relation_embedding = self.relation_embedding
        ).loss

        if torch.isnan(loss).sum() != 0: pdb.set_trace()

        del graph_batch
        torch.cuda.empty_cache()
        return {'loss': loss}

how should i solve this problem?Is there a bug in your code or is there something wrong with my operation?

Under normal circumstances, 4 consumer-grade graphics cards should be able to reproduce the t5-small as the pre-trained model in your code.

In addition, I refer to the code of picard, I try to run the code of picard, they will not have the problem of gpu memory growth, also use t5-small as the pre-training model, their gpu memory has been kept very low and there will be no gpu memory growth

The attempts I have made are as follows, except for picard, running your code basically has the problem of gpu memory growth

Not all research centers have an A100, especially in the case of shortage of gpu memory resources, can you help me to find out where the problem is? thank you!

huybery commented 1 year ago

From some of the material you provided, it seems that it should be the memory footprint from Graphix building graphs, and sometimes there are some large tables in the spider that cause the graphs to be larger. Again, as an academic project, we don't have the energy and time to adapt to smaller resources. If you can help us adapt, we'd appreciate it if you could submit a PR, thanks for your understanding.

AlibabaResearch / DAMO-ConvAI

gpu memory usage increment of the model graphix-3b #42