Split of Arxiv - Githubissues

CurryTang commented 5 months ago

Hi! Thanks for sharing the elegant codebase. I have a problem regarding the dataset split of the ogbn-arxiv. I notice that the split is given by the ArxivSplitter, which is a 10-fold split, and 80% of the data are adopted as the training set, which is different from the original split. However, it seems the performance of GCN in Table 3 is taken from OGB's leaderboard. I wonder whether the split will affect the overall performance of OFA.

LechengKong commented 5 months ago

Hi @CurryTang , thanks for helping us improve our paper. We will update the code and results accordingly in our revision. Meanwhile, we conducted experiment using GCN and our split, the results seem to be improved a bit.

CurryTang commented 5 months ago

Hi @LechengKong, I have a follow-up question regarding the evaluation of OFA-ind-st. It seems there are throwing some errors when I try to evaluate the performance of link prediction with the following commands python3 run_cdm.py task_names pubmed_link d_multiple 1 d_min_ratio 1 lr 0.001 num_layers 3 num_epochs 2 dropout 0.15.

For Cora with the following setting,

cora_link: &cora_link
  <<: *E2E-link
  eval_set_constructs:
    - stage: train
      split_name: train
      dataset: cora_link
    - stage: valid
      split_name: valid
      dataset: cora_link_eval
    - stage: test
      split_name: test
      dataset: cora_link_eval
    - stage: test
      split_name: train
      dataset: cora_link

It throws ValueError: No samples to concatenate.

For Pubmed, it throws IndexError: list index out of range when conducting the validation.

LechengKong commented 5 months ago

Can you share the file/line number that gives this error?

CurryTang commented 5 months ago

For Pubmed, it should be

Traceback (most recent call last):
  File "/mnt/home/chenzh85/graphlang/PyGFM/MyOFA/run_cdm.py", line 219, in <module>
    main(params)
  File "/mnt/home/chenzh85/graphlang/PyGFM/MyOFA/run_cdm.py", line 167, in main
    val_res, test_res = lightning_fit(
  File "/mnt/home/chenzh85/graphlang/PyGFM/MyOFA/gp/lightning/training.py", line 66, in lightning_fit
    trainer.validate(model, datamodule=data_module, verbose=False)[0]
IndexError: list index out of range

For Cora, it should be

File "/mnt/home/chenzh85/graphlang/PyGFM/MyOFA/gp/lightning/metric.py", line 149, in eval_epoch
    return evlter.compute()
  File "/mnt/home/chenzh85/.local/lib/python3.10/site-packages/torchmetrics/metric.py", line 615, in wrapped_func
    value = _squeeze_if_scalar(compute(*args, **kwargs))
  File "/mnt/home/chenzh85/.local/lib/python3.10/site-packages/torchmetrics/classification/auroc.py", line 122, in compute
    state = (dim_zero_cat(self.preds), dim_zero_cat(self.target)) if self.thresholds is None else self.confmat
  File "/mnt/home/chenzh85/.local/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 34, in dim_zero_cat
    raise ValueError("No samples to concatenate")
ValueError: No samples to concatenate

I haven't changed the source code regarding the data loader and evaluation.

CurryTang commented 5 months ago

One side note is that it throws the warning

/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:383: ModelCheckpoint(monitor='valid_pubmed_link_eval/auc') could not find the monitored key in the returned metrics: ['train_eval/loss', 'epoch', 'step']. HINT: Did you call log('valid_pubmed_link_eval/auc', value) in the LightningModule?

when evaluating the link prediction, which is not shown when evaluating node classification.

LechengKong commented 5 months ago

Hi @CurryTang , I wasn't able to reproduce the error using your command on pubmed_link. I used your Cora setup and ran the same command as your pubmed_link command (replacing with cora_link), I also can't get the same error. However, we indeed had evaluation issues in our earlier versions, can you check if you are on the most recent commit and try again? Thanks.

CurryTang commented 5 months ago

Thanks for your response! I re-run the code on the new cloned version and it seems it works well. I will check my codebase to find out the problems.

CurryTang commented 5 months ago

Hi @LechengKong! I have a follow-up question regarding the setting of d_multiple and d_min_ratio. May you share some insights or rule-of-thumbs to set these values? I have tried using the ratio given in e2e_all_config.yaml, setting it according to the number of training samples, and also the logarithm of training samples. May you share how you tune these values to reduce the search space? Many thanks

LechengKong commented 5 months ago

Hi @CurryTang , that's a very good question! When we submitted the paper, we did not use an autonomous tool to tune the parameters, because there were just too many parameters for the datasets. However, we do have some insights on tuning these values. d_multiple is the initial frequency at which a dataset appears in one epoch. We have a mechanism that monitors the validation score, if the score is not improving for several epochs, we reduce the number of samples in one epoch by half. d_min_ratio controls the minimum number of samples in one epoch.

So the rule of thumb for tuning these parameters requires you to look into the training process. Because some datasets require more training time/epoch, and when these datasets become well-trained, the rest of the dataset might already be severely overfitted. So, you want to control d_multiple so that roughly all datasets get well-trained at the same time. For example, suppose you need to train arxiv dataset for 10 epochs, and each epoch contains 1000 data points so that it is well-trained (best validation score achieved). Meanwhile, to make the model well-trained on Cora, you need to train Cora for 2 epochs, each with 200 data points. So if you train 1000 arxiv + 200 Cora for 10 epochs, Cora will be overfitted due to the 8 epochs of extra training. Then, you can set d-multiple to alleviate this issue, you can set d_multiple for arxiv to be 1, and for cora to be 0.2, and after 10 epochs you sees 1 1000 10=10000 arxiv samples and 0.220010=400 cora samples, which is roughly the number of samples required to well-train both datasets.

This becomes complicated when you train all datasets together because it is not apparent what is the required total number of trained samples for each dataset. After all, both positive and negative transfers happen. What we did was that we started with all 1 d_multiple, and recorded when the validation accuracy hit the highest point. We take the number of samples trained at that moment as the optimal number of samples and adjust the d_multiple so that after the same number of training epochs, each dataset hits its optimal # of samples roughly at the same time. I am pretty sure there are more elegant curriculum learning methods that can do this properly, but the current approach seems to work out.

CurryTang commented 5 months ago

Thanks for your detailed reply!

LechengKong / OneForAll

Split of Arxiv #9