ChestnutWYN / ACL2021-Novel-Slot-Detection

18 stars 5 forks source link

reproduce your results #1

Open XiJinping01 opened 3 years ago

XiJinping01 commented 3 years ago

I read your code and try to reproduce the results you reported in the paper. Here are changes I made.

in main.py: in last 3 lines, I change parse_token to parse_line to run SpanF1 evaluation.

    # Metrics —— Token
    # test_pred_tokens = parse_token(test_y_ns)
    # test_true_tokens = parse_token(test_outputs["true_labels"])
    test_pred_tokens = parse_line(test_y_ns)
    test_true_tokens = parse_line(test_outputs["true_labels"])
    token_metric(test_true_tokens,test_pred_tokens)

my script is


--mode test \
--dataset SnipsNSD5% \
--threshold 8.0 \
--output_dir ./output_both \
--batch_size 256 \
--cuda 0```

the result I got
```{
    "precision-overall": 0.7729279058361942,
    "recall-overall": 0.8814317673378076,
    "f1-overall": 0.8236216357459654,
    "precision-nsd": 0.17073170731707313,
    "recall-nsd": 0.4064516129032255,
    "f1-nsd": 0.2404580152671338,
    "precision-ind": 0.9059880239520958,
    "recall-ind": 0.9265156154317208,
    "f1-ind": 0.9161368452921087
}```

are `f1-nsd` and `f1-ind` SpanF1? `f1-nsd` seem much lower than reported.

Thank you!
ChestnutWYN commented 3 years ago

I think you can try to adjust the threshold. Maybe you can set the threshold to around 8.3 which get better performance in my experiment.      ------------------ Original ------------------ From: @.>; Date:  Wed, Jul 14, 2021 10:04 AM To: @.>; Cc: @.***>; Subject:  [ChestnutWYN/ACL2021-Novel-Slot-Detection] reproduce your results (#1)

 

I read your code and try to reproduce the results you reported in the paper. Here are changes I made.

in main.py: in last 3 lines, I change parse_token to parse_line to run SpanF1 evaluation.

Metrics —— Token # test_pred_tokens = parse_token(test_y_ns) # test_true_tokens = parse_token(test_outputs["true_labels"]) test_pred_tokens = parse_line(test_y_ns) test_true_tokens = parse_line(test_outputs["true_labels"]) token_metric(test_true_tokens,test_pred_tokens)

my script is --mode test \ --dataset SnipsNSD5% \ --threshold 8.0 \ --output_dir ./output_both \ --batch_size 256 \ --cuda 0the result I got{ "precision-overall": 0.7729279058361942, "recall-overall": 0.8814317673378076, "f1-overall": 0.8236216357459654, "precision-nsd": 0.17073170731707313, "recall-nsd": 0.4064516129032255, "f1-nsd": 0.2404580152671338, "precision-ind": 0.9059880239520958, "recall-ind": 0.9265156154317208, "f1-ind": 0.9161368452921087 }`` aref1-nsdandf1-indSpanF1?f1-nsd` seem much lower than reported. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

XiJinping01 commented 3 years ago

By setting threshold to 8.3 I got

    "precision-overall": 0.8368421052631579,
    "recall-overall": 0.889261744966443,
    "f1-overall": 0.8622559652927917,
    "precision-nsd": 0.262443438914027,
    "recall-nsd": 0.3741935483870965,
    "f1-nsd": 0.3085106382978237,
    "precision-ind": 0.912447885646218,
    "recall-ind": 0.9381506429883649,
    "f1-ind": 0.9251207729468099

closer to results reported. Thanks