multiple gpus - Githubissues

boxorange commented 2 months ago

I ran into OOM error to train a huge data. I have 4 GPUs, and I wonder if there's a way to use multiple gpus?

HarryShomer commented 2 months ago

Hi,

Currently multiple GPUs aren't supported. To be honest, I've never used multiple GPUs with torch so I'm unsure if it's easy to implement. I'd recommend taking a look here if you want to give it a try yourself.

Otherwise, to remediate the OOM error, I'd recommend decreasing either the hidden dimension (--dim argument), the batch size (--batch-size argument), or both.

Another option is to increase the threshold when approximating the PPR matrix. This can be done by increasing the value of the --eps argument. Increasing the value of --eps would make the resulting PPR matrix sparser, resulting in less GPU memory.

Regards, Harry

boxorange commented 2 months ago

Hi,

Thanks for the prompt response. Your suggestions were very helpful in running the model. I realized that the OOM issue was actually caused by the test batch size, which was set to 32,768 by default. I reduced it to around 5,000, and the issue was resolved. To further accelerate the training time with less memory usage, I tried using the PPR matrix with a larger eps as you suggested, such as 5e-5 or 2.5e-3, but I encountered the following error. Do you have any ideas on how to resolve this error?

Traceback (most recent call last):
  File "/home/ac.gpark/LPFormer/src/run.py", line 325, in <module>
    main()
  File "/home/ac.gpark/LPFormer/src/run.py", line 321, in main
    run_model(args)
  File "/home/ac.gpark/LPFormer/src/run.py", line 240, in run_model
    train_data(cmd_args, args, data, device, verbose = not cmd_args.non_verbose)
  File "/home/ac.gpark/LPFormer/src/train/train_model.py", line 190, in train_data
    best_valid = train_loop(args, train_args, data, device, loggers, seed, run_save_name, verbose)
  File "/home/ac.gpark/LPFormer/src/train/train_model.py", line 111, in train_loop
    loss = train_epoch(model, score_func, data, optimizer, args, device)
  File "/home/ac.gpark/LPFormer/src/train/train_model.py", line 59, in train_epoch
    h = model(edges, adj_prop=masked_adjt, adj_mask=masked_adj)
  File "/home/ac.gpark/anaconda3/envs/ness/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ac.gpark/anaconda3/envs/ness/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ac.gpark/LPFormer/src/models/link_transformer.py", line 120, in forward
    pairwise_feats, att_weights = self.calc_pairwise(batch, X_node, test_set, adj_mask=adj_mask, return_weights=return_weights)
  File "/home/ac.gpark/LPFormer/src/models/link_transformer.py", line 175, in calc_pairwise
    cn_info, onehop_info, non1hop_info = self.compute_node_mask(batch, test_set, adj_mask)
  File "/home/ac.gpark/LPFormer/src/models/link_transformer.py", line 265, in compute_node_mask
    pair_ix, node_type, src_ppr, tgt_ppr = self.get_ppr_vals(batch, pair_adj, test_set)
  File "/home/ac.gpark/LPFormer/src/models/link_transformer.py", line 329, in get_ppr_vals
    pair_diff_adj = pair_diff_adj[src_ppr != 0]
IndexError: The shape of the mask [8763338] at index 0 does not match the shape of the indexed tensor [8758217] at index 0

FYI, this is my data stats.

node: 6,372 (765 features)
edge: 1,731,773 (train: 965,654, valid: 413,852, test: 352,267)

Many thanks in advance.

HarryShomer commented 2 months ago

Hmm that's strange.

Is the dataset publicly available? If so I can take a closer look. Otherwise I an try to replicate this error on other datasets when I have the chance.

Also, does it run correctly for lower values of epsilon? If so, I'd just use that. Since the number of nodes in your graph is relatively low, storing the PPR matrix shouldn't be a big issue and should have a marginal effect on runtime. I'd instead increasethe PPR thresholds (--thresh-1hop 1e-2 and --thresh-non1hop) which should help. Given the density of your graph, setting both to 1e-2 should be ok.

boxorange commented 2 months ago

Yes, it runs fine with the epsilon 1e-7. Thanks for the suggestions. When I set those arguments to 1e-2, it reduced the training time a little bit.

I'm using the dataset called SEMNET: https://github.com/MarioKrenn6240/SEMNET. If you let me know how to share my data, please let me know. I'm willing to share the SEMNET data that I am using.

Could it be caused by that I created negative edges by HeaRT heart_negatives/create_heart_negatives.py? It looks like you used the datasets from the HeaRT package, and I couldn't find a code to generate negative edges in your code. So, I used heart_negatives/create_heart_negatives.py to create test_neg.txt and valid_neg.txt, and I wonder if this might cause the error. Do you know how HeaRT authors got test_neg.txt, and valid_neg.txt files? Or, do you have better suggestions to generate negative edges for my data to run on LPFormer? I'm using replicate_existing.sh not replicate_heart.sh.

HarryShomer commented 2 months ago

If you let me know how to share my data, please let me know. I'm willing to share the SEMNET data that I am using.

You can zip the data and share it via google drive, onedrive, or whatever works for you.

Could it be caused by that I created negative edges by HeaRT

For the heart evaluation, I used the same code to generate the negatives that can be found here - https://github.com/Juanhui28/HeaRT. For the other original/old evaluation, I used the negatives supplied by the dataset.

Also tbh, I don't see how this can be causing the issue, since based on the traceback you shared, it's throwing an error during training on the positive samples.

Or, do you have better suggestions to generate negative edges for my data to run on LPFormer?

This is kind of up to you, the original/existing eval setting uses randomly sampled negatives for evaluation. If that fits your task then it's fine.

I'm using replicate_existing.sh not replicate_heart.sh

Fwiw, if you're using negatives generated by HeaRT, you should add the --heart flag to your command (see replicate_heart.sh).

As an aside, I'm going to be a little busy over the next 2-3 weeks. I finishing a paper for a deadline and then moving for a summer internship. So I prob won't be able to spend time on this until mid-June.

HarryShomer commented 1 week ago

Hi @boxorange,

Sorry I completely forgot about this. I've been busy with an internship.

I haven't been able to replicate the error on any of the existing datasets used in the paper.

If you can share your dataset (and any modifications to the code to load it). I'd be happy to take a look.

Regards, Harry

HarryShomer / LPFormer

multiple gpus #1