Self-cycle in gold-only cognitive graph for comparison question

zycdev commented 5 years ago

Hi,

I found it may cause a self-cycle in the following snippet. https://github.com/THUDM/CogQA/blob/217f0f12819c86413d315abf9d818da05c41cb9d/process_train.py#L91-L93

For example, after running process_train.py, I got a JSON object like this:

{
  "supporting_facts": [
    [
      "Arthur's Magazine",
      0,
      [
        [
          "Arthur's Magazine",
          "Arthur's Magazine",
          0,
          17
        ]
      ]
    ],
    [
      "First for Women",
      0,
      [
        [
          "First for Women",
          "First for Women",
          0,
          15
        ]
      ]
    ]
  ],
  "level": "medium",
  "question": "Which magazine was started first Arthur's Magazine or First for Women?",
  "context": ["..."],
  "answer": "Arthur's Magazine",
  "_id": "5a7a06935542990198eaf050",
  "type": "comparison",
  "Q_edge": [
    [
      "First for Women",
      "First for Women",
      54,
      69
    ],
    [
      "Arthur's Magazine",
      "Arthur's Magazine",
      33,
      50
    ]
  ]
}

However, I think it should be like what showed in your examples:

{
  "supporting_facts": [
    [
      "Arthur's Magazine",
      0,
      []
    ],
    [
      "First for Women",
      0,
      []
    ]
  ],
  "level": "medium",
  "question": "Which magazine was started first Arthur's Magazine or First for Women?",
  "context": ["..."],
  "answer": "Arthur's Magazine",
  "_id": "5a7a06935542990198eaf050",
  "type": "comparison",
  "Q_edge": [
    [
      "Arthur's Magazine",
      "Arthur's Magazine",
      33,
      50
    ],
    [
      "First for Women",
      "First for Women",
      54,
      69
    ]
  ]
}

Could you explain what this snippet works for? By the way, I got a reproduction result which is about 10% lower than the result in the paper on dev set with 2 K80 GPUs, do you think this snippet is a reason of low result?

Thank you!

Sleepychord commented 5 years ago

Hi, thank you for pointing out it. I have conducted some minor modification in the pre-processing scripts after generating the examples, but I do not think it is a main reason (maybe I am wrong). In my experiments, learning_rate, batch_size and early stop strategies(if you add) and some other parameters can affect the results up to 10%. Maybe you can try to delete the linear_warm_up in task #2(I realize that after finishing the paper)?

zycdev commented 5 years ago

Hi, @Sleepychord , very thank you for your reply! I retrained BERT for 1 epoch and then BERT & GNN for 1 epoch with hyperparameters as you showed in the paper, but I still can't reproduce the result of the paper on dev set.

My training commands:

export CUDA_VISIBLE_DEVICES=0,1,2,3  # 4 K80(12GB memory) GPUs
python train.py --batch-size=10 --lr1=1e-4
python train.py --load=True --mode='bundle' --batch-size=10 --lr1=4e-5 --lr2=1e-4  # haven't delete the linear_warm_up yet

and my evaluation result on dev set:

{'em': 0.2598244429439568, 'f1': 0.35564370767865855, 'prec': 0.37582762612134724, 'recall': 0.35888658012669966, 'sp_em': 0.07562457798784605, 'sp_f1': 0.3665706092242228, 'sp_prec': 0.4997955049676863, 'sp_recall': 0.3207705540014783, 'joint_em': 0.03349088453747468, 'joint_f1': 0.19135653981707093, 'joint_prec': 0.2720478977096129, 'joint_recall': 0.17037639264369026}

Could you provide more details about hyperparameters, training strategies of your best experimental result? I am looking forward to your advice.

Thanks!

qibinc commented 5 years ago

Hi @zycdev ,

I'm not sure what's the problem you encountered but I've successfully got reasonable results with the scripts you provided. I also made an improved version of CogQA here, which is be much faster and far less resource-demanding for task 2, with slightly better results. You can try that out.

Hope this helps!

Sleepychord commented 5 years ago

@zycdev , I think that tuning the learning_rate in task #2 is effective. Thank @qibinc for improvement and maybe you can follow it.

zycdev commented 5 years ago

@qibinc @Sleepychord Thank you very much for your work, I am glad to try the new version!

qibinc commented 5 years ago

Hi @zycdev ,

Here is an example for running the new version:

For task 1, run:

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --batch-size 16 --expname test --weight-decay 0.01

For task 2, run:

CUDA_VISIBLE_DEVICES=0 python train.py --load --load-path saved/bert-base-uncased-test.bin --mode '#2' --lr1 2e-5 --gradient-accumulation-steps 8 --expname test --tune

(that's right, now we only need one GPU for task 2)

For inference, run:

CUDA_VISIBLE_DEVICES=0 python infer.py --data-file data/hotpot_dev_fullwiki_v1_merge.json --model-file saved/bert-base-uncased-test.bin

Evaluation:

python scripts/hotpot_evaluate_v1.py data/hotpot_dev_fullwiki_v1_merge_pred.json data/hotpot_dev_fullwiki_v1_merge.json

zycdev commented 5 years ago

Hi @qibinc , I am grateful for the guideline, just wanting to ask for this :-D

ditingdapeng commented 3 years ago

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --batch-size 16 --expname test --weight-decay 0.01
老哥你好，我看你的代码中数据用的是"hotpot_train_v1.1_refined.json"，这个refined是因为对数据做了什么改变吗？

Sleepychord commented 3 years ago

@ditingdapeng refined是预处理过的，每个QA pair增加了两个域表示模糊匹配等算法抽取出来的真实cognitive graph的节点，对数据本身没有别的改动。

ditingdapeng commented 3 years ago

请问这个refined用的什么方法处理？和原论文中处理的一样吗？多谢大佬回复~！ On 11/3/2020 22:47，Sleepy_chordnotifications@github.com wrote：

@ditingdapeng refined是预处理过的，每个QA pair增加了两个域表示模糊匹配等算法抽取出来的真实cognitive graph的节点，对数据本身没有别的改动。

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

THUDM / CogQA

Self-cycle in gold-only cognitive graph for comparison question #9