Trying to achieve same results as "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" paper

Hello

I am trying to achieve the same results as "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" paper, but it doesn't seem to match the results that the paper claims to have after 50 epochs. I've also read XuezheMax/NeuroNLP2#8 issue.

Because I'm using Windows, I got the hyper-parameters off the .sh script and wrote them direct into the NERCRF.py code.

After 50 epochs, using the GloVe embeddings with 100 dimensions and CoNLL-2003 corpus (which I downloaded from this repository), I've only managed a 84.76% F1 score in my dev data and a 80.32% F1 score in my test data. Are the hyper-parameters rights? Did you use eng.testa for dev data and eng.testb for test data, or did you used different files? Should I pay attention to anything else?

Thanks.

Hi,

The hyper-parameters seems reasonable, but the results are surprisingly low. I used standard train/dev/test data in CoNLL-2003. I am not familiar with Pytorch in windows, but I guess you need to use pytorch0.4, right? In this case, please switch to branch 'pytorch4.0'

On Mon, Apr 30, 2018 at 3:34 PM, Ayrton Denner notifications@github.com wrote:

Hello

I am trying to achieve the same results as "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" paper, but it doesn't seem to match the results that the paper claims to have after 50 epochs. I've also read XuezheMax/NeuroNLP2#8 https://github.com/XuezheMax/NeuroNLP2/issues/8 issue.

Because I'm using Windows, I got the hyper-parameters off the .sh script and wrote them direct into the NERCRF.py code.

[image: image] https://user-images.githubusercontent.com/13112588/39445920-03eb5880-4c93-11e8-90e2-cb73ad5f355e.png

After 50 epochs, using the GloVe embeddings with 100 dimensions and CoNLL-2003 corpus (which I downloaded from this repository https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003), I've only managed a 84.76% F1 score in my dev data and a 80.32% F1 score in my test data. Are the hyper-parameters rights? Did you use eng.testa for dev data and eng.testb for test data, or did you used different files? Should I pay attention to anything else?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/13, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUtljuPdNndXkvtbZGVFJfjCsb_v-ptks5tt2dbgaJpZM4TtGiE .

--

Best regards, Ma，Xuezhe Language Technologies Institute, School of Computer Science, Carnegie Mellon University Tel: +1 206-512-5977

Hello. I'm actually using 0.3.1.post2 of PyTorch. Should I update it to 0.4? Could a different version produce a different performance outcome as well? Seems weird...

No, I just make sure that you used the correct version because there are some major changes from pytorch0.3 to 0.4 which may cause some wired issues.

On Mon, Apr 30, 2018 at 4:38 PM, Ayrton Denner notifications@github.com wrote:

Hello. I'm actually using 0.3.1.post2 of PyTorch. Should I update it to 0.4? Could a different version produce a different performance outcome as well? Seems weird...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/13#issuecomment-385521478, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUtlm1TMKAb14yrYFHYNYSf4avmmivOks5tt3ZYgaJpZM4TtGiE .

--

Best regards, Ma，Xuezhe Language Technologies Institute, School of Computer Science, Carnegie Mellon University Tel: +1 206-512-5977

Hi @XuezheMax, I'm also running the run_ner_crf script and I'm having problems getting to the results reported in your paper. I'm getting results similar to the ones @ayrtondenner got. I'm using your pytorch0.4 branch with the following settings:

Anaconda 4.5.1 with python 3.6.3
pytorch 0.4.0
gensim 3.4.0
glove embeddings glove.6B.100d.gz
train, test and dev data are the ones I got from https://github.com/glample/tagger/tree/master/dataset. I adapted your code in my fork to disregard those starting numbers at each line. Should this make any difference?
The rest of the hyperparameters are the default ones that are set in the examples/run_ner_crf.sh script.

What could be wrong?

Thanks!

Hi, I am not sure what is the problem. One possible reason might be the tagging schema (BIO). If you are using the original data from conll 03, you need to convert it to the standard bio schema or the more advanced bioes (marginal improvement)

I see, I noticed that the annotation scheme is really messed up. The LSTM-CRF from Lample fixes this in memory, but the training file is the same, that's why it doesn't matter for his code. Do you know where I could get this conll 2003 corpus annotated the proper way? In either BIO or BIOES scheme.

Here is the code I used to convert it to BIO

def transform(ifile, ofile):
    with open(ifile, 'r') as reader, open(ofile, 'w') as writer:
        prev = 'O'
        for line in reader:
            line = line.strip()
            if len(line) == 0:
                prev = 'O'
                writer.write('\n')
                continue

            tokens = line.split()
            # print tokens
            label = tokens[-1]
            if label != 'O' and label != prev:
                if prev == 'O':
                    label = 'B-' + label[2:]
                elif label[2:] != prev[2:]:
                    label = 'B-' + label[2:]
                else:
                    label = label
            writer.write(" ".join(tokens[:-1]) + " " + label)
            writer.write('\n')
            prev = tokens[-1]

Great, thanks @XuezheMax !

Strangely, it doesn't seem to have made any difference :thinking: I don't suppose using those starting numbers is relevant to determine where each document or sentence finishes, is it? Can you confirm that the exact parameters in run_ner_crf.sh should be enough to reach a 90% F1 score on the test set? Some of them are different from what you report on your paper, but maybe the difference doesn't matter.

Yes, I am sure that using the exact parameters in run_ner_crf.sh should give around 91% F1 score on test set.

Would you please paste your log here so I can check the possible issues. Again, make sure to remove the alphabets folder in data/ to create new vocabulary files.

Yes, I did remove the alphabets folder :+1:

I'm running a new training now with the latest adjustments. Fixed another place in the code that was referring to the word token with the wrong index (after removing the starting numbers). Here's the log so far:

/home/pedro/virtualenv/pytorch/bin/python /home/pedro/pycharm-community-2017.3.2/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 37531 --file /home/pedro/repositorios/NeuroNLP2/examples/NERCRF.py --cuda --mode LSTM --num_epochs 200 --batch_size 16 --hidden_size 256 --num_layers 1 --char_dim 30 --num_filters 30 --tag_space 128 --learning_rate 0.01 --decay_rate 0.05 --schedule 1 --gamma 0.0 --dropout std --p_in 0.33 --p_rnn 0.33 0.5 --p_out 0.5 --unk_replace 0.0 --bigram --embedding glove --embedding_dict /media/discoD/embeddings/English/Glove/glove.6B/glove.6B.100d.gz --train data/conll2003/english/eng.train.bios --dev data/conll2003/english/eng.testa.bios --test data/conll2003/english/eng.testb.bios Connected to pydev debugger (build 181.4203.547) pydev debugger: process 4141 is connecting

loading embedding: glove from /media/discoD/embeddings/English/Glove/glove.6B/glove.6B.100d.gz 2018-06-06 15:49:10,504 - NERCRF - INFO - Creating Alphabets 2018-06-06 15:49:10,504 - Create Alphabets - INFO - Creating Alphabets: data/alphabets/ner_crf/ 2018-06-06 15:49:11,628 - Create Alphabets - INFO - Total Vocabulary Size: 20102 2018-06-06 15:49:11,628 - Create Alphabets - INFO - Total Singleton Size: 9178 2018-06-06 15:49:11,630 - Create Alphabets - INFO - Total Vocabulary Size (w.o rare words): 19046 2018-06-06 15:49:12,295 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 23598 (8122) 2018-06-06 15:49:12,296 - Create Alphabets - INFO - Character Alphabet Size: 86 2018-06-06 15:49:12,296 - Create Alphabets - INFO - POS Alphabet Size: 47 2018-06-06 15:49:12,296 - Create Alphabets - INFO - Chunk Alphabet Size: 19 2018-06-06 15:49:12,296 - Create Alphabets - INFO - NER Alphabet Size: 10 2018-06-06 15:49:12,296 - NERCRF - INFO - Word Alphabet Size: 23598 2018-06-06 15:49:12,296 - NERCRF - INFO - Character Alphabet Size: 86 2018-06-06 15:49:12,296 - NERCRF - INFO - POS Alphabet Size: 47 2018-06-06 15:49:12,296 - NERCRF - INFO - Chunk Alphabet Size: 19 2018-06-06 15:49:12,296 - NERCRF - INFO - NER Alphabet Size: 10 2018-06-06 15:49:12,296 - NERCRF - INFO - Reading Data Reading data from data/conll2003/english/eng.train.bios reading data: 10000 Total number of data: 14987 Reading data from data/conll2003/english/eng.testa.bios Total number of data: 3466 Reading data from data/conll2003/english/eng.testb.bios Total number of data: 3684 oov: 339 2018-06-06 15:53:01,370 - NERCRF - INFO - constructing network... /home/pedro/virtualenv/pytorch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1 "num_layers={}".format(dropout, num_layers)) 2018-06-06 15:53:01,387 - NERCRF - INFO - Network: LSTM, num_layer=1, hidden=256, filter=30, tag_space=128, crf=bigram 2018-06-06 15:53:01,387 - NERCRF - INFO - training: l2: 0.000000, (#training data: 14987, batch: 16, unk replace: 0.00) 2018-06-06 15:53:01,387 - NERCRF - INFO - dropout(in, out, rnn): (0.33, 0.50, (0.33, 0.5)) Epoch 1 (LSTM(std), learning rate=0.0100, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 11.1595, time left (estimated): 15.22s train: 200/937 loss: 7.2109, time left (estimated): 12.09s train: 300/937 loss: 5.8057, time left (estimated): 10.10s train: 400/937 loss: 5.0669, time left (estimated): 8.42s train: 500/937 loss: 4.5988, time left (estimated): 6.86s train: 600/937 loss: 4.2958, time left (estimated): 5.30s train: 700/937 loss: 4.0640, time left (estimated): 3.72s train: 800/937 loss: 3.8781, time left (estimated): 2.16s train: 900/937 loss: 3.7093, time left (estimated): 0.59s train: 937 loss: 3.6504, time: 14.58s dev acc: 97.02%, precision: 79.24%, recall: 75.75%, F1: 77.45% best dev acc: 97.02%, precision: 79.24%, recall: 75.75%, F1: 77.45% (epoch: 1) best test acc: 96.35%, precision: 74.47%, recall: 71.87%, F1: 73.15% (epoch: 1) Epoch 2 (LSTM(std), learning rate=0.0095, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 2.3227, time left (estimated): 12.82s train: 200/937 loss: 2.4067, time left (estimated): 11.64s train: 300/937 loss: 2.4593, time left (estimated): 10.47s train: 400/937 loss: 2.4737, time left (estimated): 8.83s train: 500/937 loss: 2.4559, time left (estimated): 7.14s train: 600/937 loss: 2.4435, time left (estimated): 5.52s train: 700/937 loss: 2.4438, time left (estimated): 3.89s train: 800/937 loss: 2.4204, time left (estimated): 2.26s train: 900/937 loss: 2.3705, time left (estimated): 0.61s train: 937 loss: 2.3726, time: 15.26s dev acc: 97.55%, precision: 80.98%, recall: 79.55%, F1: 80.26% best dev acc: 97.55%, precision: 80.98%, recall: 79.55%, F1: 80.26% (epoch: 2) best test acc: 96.58%, precision: 75.61%, recall: 74.53%, F1: 75.07% (epoch: 2) Epoch 3 (LSTM(std), learning rate=0.0091, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 2.1304, time left (estimated): 13.11s train: 200/937 loss: 2.1364, time left (estimated): 11.71s train: 300/937 loss: 2.2066, time left (estimated): 10.44s train: 400/937 loss: 2.1977, time left (estimated): 8.77s train: 500/937 loss: 2.1580, time left (estimated): 7.15s train: 600/937 loss: 2.1675, time left (estimated): 5.62s train: 700/937 loss: 2.1589, time left (estimated): 3.94s train: 800/937 loss: 2.1703, time left (estimated): 2.29s train: 900/937 loss: 2.1547, time left (estimated): 0.62s train: 937 loss: 2.1668, time: 15.58s dev acc: 97.69%, precision: 81.49%, recall: 79.97%, F1: 80.72% best dev acc: 97.69%, precision: 81.49%, recall: 79.97%, F1: 80.72% (epoch: 3) best test acc: 96.99%, precision: 77.07%, recall: 75.80%, F1: 76.43% (epoch: 3) Epoch 4 (LSTM(std), learning rate=0.0087, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.8794, time left (estimated): 12.88s train: 200/937 loss: 1.9610, time left (estimated): 11.79s train: 300/937 loss: 1.9138, time left (estimated): 9.95s train: 400/937 loss: 1.8985, time left (estimated): 8.52s train: 500/937 loss: 1.9170, time left (estimated): 7.04s train: 600/937 loss: 1.8895, time left (estimated): 5.45s train: 700/937 loss: 1.8744, time left (estimated): 3.83s train: 800/937 loss: 1.8929, time left (estimated): 2.23s train: 900/937 loss: 1.8825, time left (estimated): 0.61s train: 937 loss: 1.8929, time: 15.16s dev acc: 98.00%, precision: 82.79%, recall: 81.04%, F1: 81.91% best dev acc: 98.00%, precision: 82.79%, recall: 81.04%, F1: 81.91% (epoch: 4) best test acc: 97.13%, precision: 77.70%, recall: 76.02%, F1: 76.85% (epoch: 4) Epoch 5 (LSTM(std), learning rate=0.0083, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.6122, time left (estimated): 12.56s train: 200/937 loss: 1.7545, time left (estimated): 11.42s train: 300/937 loss: 1.8272, time left (estimated): 10.19s train: 400/937 loss: 1.8695, time left (estimated): 8.71s train: 500/937 loss: 1.8206, time left (estimated): 6.98s train: 600/937 loss: 1.8122, time left (estimated): 5.43s train: 700/937 loss: 1.7974, time left (estimated): 3.80s train: 800/937 loss: 1.7895, time left (estimated): 2.21s train: 900/937 loss: 1.7844, time left (estimated): 0.60s train: 937 loss: 1.7592, time: 14.92s dev acc: 98.03%, precision: 82.51%, recall: 82.19%, F1: 82.35% best dev acc: 98.03%, precision: 82.51%, recall: 82.19%, F1: 82.35% (epoch: 5) best test acc: 97.14%, precision: 77.33%, recall: 77.21%, F1: 77.27% (epoch: 5) Epoch 6 (LSTM(std), learning rate=0.0080, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.7967, time left (estimated): 13.76s train: 200/937 loss: 1.7380, time left (estimated): 12.11s train: 300/937 loss: 1.7062, time left (estimated): 10.29s train: 400/937 loss: 1.7048, time left (estimated): 8.64s train: 500/937 loss: 1.7066, time left (estimated): 7.05s train: 600/937 loss: 1.7288, time left (estimated): 5.49s train: 700/937 loss: 1.7400, time left (estimated): 3.88s train: 800/937 loss: 1.7497, time left (estimated): 2.23s train: 900/937 loss: 1.7627, time left (estimated): 0.61s train: 937 loss: 1.7641, time: 15.22s dev acc: 98.04%, precision: 82.49%, recall: 82.88%, F1: 82.69% best dev acc: 98.04%, precision: 82.49%, recall: 82.88%, F1: 82.69% (epoch: 6) best test acc: 97.07%, precision: 76.98%, recall: 78.21%, F1: 77.59% (epoch: 6) Epoch 7 (LSTM(std), learning rate=0.0077, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.6099, time left (estimated): 12.96s train: 200/937 loss: 1.7350, time left (estimated): 11.93s train: 300/937 loss: 1.7129, time left (estimated): 10.28s train: 400/937 loss: 1.7469, time left (estimated): 8.82s train: 500/937 loss: 1.7572, time left (estimated): 7.17s train: 600/937 loss: 1.7370, time left (estimated): 5.55s train: 700/937 loss: 1.7093, time left (estimated): 3.89s train: 800/937 loss: 1.6880, time left (estimated): 2.23s train: 900/937 loss: 1.6875, time left (estimated): 0.61s train: 937 loss: 1.6810, time: 15.08s dev acc: 98.21%, precision: 83.37%, recall: 82.10%, F1: 82.73% best dev acc: 98.21%, precision: 83.37%, recall: 82.10%, F1: 82.73% (epoch: 7) best test acc: 97.24%, precision: 78.31%, recall: 77.17%, F1: 77.74% (epoch: 7) Epoch 8 (LSTM(std), learning rate=0.0074, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.4601, time left (estimated): 12.74s train: 200/937 loss: 1.7144, time left (estimated): 12.05s train: 300/937 loss: 1.6738, time left (estimated): 10.41s train: 400/937 loss: 1.6353, time left (estimated): 8.70s train: 500/937 loss: 1.6488, time left (estimated): 7.16s train: 600/937 loss: 1.6255, time left (estimated): 5.44s train: 700/937 loss: 1.6026, time left (estimated): 3.82s train: 800/937 loss: 1.5943, time left (estimated): 2.20s train: 900/937 loss: 1.5904, time left (estimated): 0.60s train: 937 loss: 1.5851, time: 15.00s dev acc: 98.16%, precision: 83.43%, recall: 81.04%, F1: 82.22% best dev acc: 98.21%, precision: 83.37%, recall: 82.10%, F1: 82.73% (epoch: 7) best test acc: 97.24%, precision: 78.31%, recall: 77.17%, F1: 77.74% (epoch: 7) Epoch 9 (LSTM(std), learning rate=0.0071, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.5425, time left (estimated): 12.59s train: 200/937 loss: 1.6459, time left (estimated): 11.54s train: 300/937 loss: 1.6891, time left (estimated): 10.43s train: 400/937 loss: 1.6785, time left (estimated): 8.78s train: 500/937 loss: 1.6821, time left (estimated): 7.16s train: 600/937 loss: 1.6776, time left (estimated): 5.53s train: 700/937 loss: 1.6908, time left (estimated): 3.96s train: 800/937 loss: 1.6926, time left (estimated): 2.29s train: 900/937 loss: 1.6696, time left (estimated): 0.62s train: 937 loss: 1.6775, time: 15.54s dev acc: 98.28%, precision: 83.63%, recall: 82.79%, F1: 83.21% best dev acc: 98.28%, precision: 83.63%, recall: 82.79%, F1: 83.21% (epoch: 9) best test acc: 97.40%, precision: 78.73%, recall: 78.42%, F1: 78.58% (epoch: 9) Epoch 10 (LSTM(std), learning rate=0.0069, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.5739, time left (estimated): 14.88s train: 200/937 loss: 1.4929, time left (estimated): 12.41s train: 300/937 loss: 1.4723, time left (estimated): 10.52s train: 400/937 loss: 1.5359, time left (estimated): 8.98s train: 500/937 loss: 1.4927, time left (estimated): 7.15s train: 600/937 loss: 1.4833, time left (estimated): 5.50s train: 700/937 loss: 1.4559, time left (estimated): 3.83s train: 800/937 loss: 1.4410, time left (estimated): 2.18s train: 900/937 loss: 1.4595, time left (estimated): 0.60s train: 937 loss: 1.4702, time: 15.02s dev acc: 98.34%, precision: 83.74%, recall: 83.01%, F1: 83.37% best dev acc: 98.34%, precision: 83.74%, recall: 83.01%, F1: 83.37% (epoch: 10) best test acc: 97.48%, precision: 78.92%, recall: 78.56%, F1: 78.74% (epoch: 10) Epoch 11 (LSTM(std), learning rate=0.0067, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.5810, time left (estimated): 14.03s train: 200/937 loss: 1.5853, time left (estimated): 12.40s train: 300/937 loss: 1.5423, time left (estimated): 10.69s train: 400/937 loss: 1.5091, time left (estimated): 8.81s train: 500/937 loss: 1.4996, time left (estimated): 7.09s train: 600/937 loss: 1.4911, time left (estimated): 5.46s train: 700/937 loss: 1.4757, time left (estimated): 3.83s train: 800/937 loss: 1.4645, time left (estimated): 2.21s train: 900/937 loss: 1.4694, time left (estimated): 0.61s train: 937 loss: 1.4674, time: 15.13s dev acc: 98.36%, precision: 83.55%, recall: 83.36%, F1: 83.46% best dev acc: 98.36%, precision: 83.55%, recall: 83.36%, F1: 83.46% (epoch: 11) best test acc: 97.57%, precision: 78.82%, recall: 79.07%, F1: 78.94% (epoch: 11) Epoch 12 (LSTM(std), learning rate=0.0065, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.1637, time left (estimated): 12.80s train: 200/937 loss: 1.2805, time left (estimated): 11.64s train: 300/937 loss: 1.3509, time left (estimated): 10.32s train: 400/937 loss: 1.3464, time left (estimated): 8.69s train: 500/937 loss: 1.3561, time left (estimated): 7.01s train: 600/937 loss: 1.3453, time left (estimated): 5.38s train: 700/937 loss: 1.3587, time left (estimated): 3.78s train: 800/937 loss: 1.3513, time left (estimated): 2.19s train: 900/937 loss: 1.3726, time left (estimated): 0.61s train: 937 loss: 1.3741, time: 15.10s dev acc: 98.16%, precision: 83.13%, recall: 83.11%, F1: 83.12% best dev acc: 98.36%, precision: 83.55%, recall: 83.36%, F1: 83.46% (epoch: 11) best test acc: 97.57%, precision: 78.82%, recall: 79.07%, F1: 78.94% (epoch: 11) Epoch 13 (LSTM(std), learning rate=0.0062, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.5685, time left (estimated): 15.13s train: 200/937 loss: 1.5330, time left (estimated): 13.42s train: 300/937 loss: 1.5295, time left (estimated): 11.47s train: 400/937 loss: 1.4667, time left (estimated): 9.38s train: 500/937 loss: 1.5124, time left (estimated): 7.85s train: 600/937 loss: 1.5023, time left (estimated): 6.03s train: 700/937 loss: 1.4821, time left (estimated): 4.17s train: 800/937 loss: 1.4831, time left (estimated): 2.41s train: 900/937 loss: 1.4986, time left (estimated): 0.66s train: 937 loss: 1.4936, time: 16.46s dev acc: 98.51%, precision: 84.16%, recall: 83.58%, F1: 83.87% best dev acc: 98.51%, precision: 84.16%, recall: 83.58%, F1: 83.87% (epoch: 13) best test acc: 97.72%, precision: 79.45%, recall: 79.14%, F1: 79.30% (epoch: 13) Epoch 14 (LSTM(std), learning rate=0.0061, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.2822, time left (estimated): 12.56s train: 200/937 loss: 1.3552, time left (estimated): 11.52s train: 300/937 loss: 1.3195, time left (estimated): 9.87s train: 400/937 loss: 1.3449, time left (estimated): 8.48s train: 500/937 loss: 1.3591, time left (estimated): 6.98s train: 600/937 loss: 1.3216, time left (estimated): 5.32s train: 700/937 loss: 1.3230, time left (estimated): 3.79s train: 800/937 loss: 1.3476, time left (estimated): 2.21s train: 900/937 loss: 1.3365, time left (estimated): 0.60s train: 937 loss: 1.3412, time: 14.99s dev acc: 98.42%, precision: 83.90%, recall: 83.42%, F1: 83.66% best dev acc: 98.51%, precision: 84.16%, recall: 83.58%, F1: 83.87% (epoch: 13) best test acc: 97.72%, precision: 79.45%, recall: 79.14%, F1: 79.30% (epoch: 13) Epoch 15 (LSTM(std), learning rate=0.0059, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.5639, time left (estimated): 14.34s train: 200/937 loss: 1.5256, time left (estimated): 12.75s train: 300/937 loss: 1.5398, time left (estimated): 11.06s train: 400/937 loss: 1.5272, time left (estimated): 9.35s train: 500/937 loss: 1.5028, time left (estimated): 7.52s train: 600/937 loss: 1.4775, time left (estimated): 5.78s train: 700/937 loss: 1.4980, time left (estimated): 4.12s train: 800/937 loss: 1.4719, time left (estimated): 2.37s train: 900/937 loss: 1.4516, time left (estimated): 0.64s train: 937 loss: 1.4439, time: 15.85s dev acc: 98.45%, precision: 83.76%, recall: 82.96%, F1: 83.36% best dev acc: 98.51%, precision: 84.16%, recall: 83.58%, F1: 83.87% (epoch: 13) best test acc: 97.72%, precision: 79.45%, recall: 79.14%, F1: 79.30% (epoch: 13) Epoch 16 (LSTM(std), learning rate=0.0057, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.0337, time left (estimated): 11.95s train: 200/937 loss: 1.2146, time left (estimated): 11.50s train: 300/937 loss: 1.2163, time left (estimated): 10.00s train: 400/937 loss: 1.2734, time left (estimated): 8.66s train: 500/937 loss: 1.3102, time left (estimated): 7.14s train: 600/937 loss: 1.3274, time left (estimated): 5.56s train: 700/937 loss: 1.3259, time left (estimated): 3.90s train: 800/937 loss: 1.3224, time left (estimated): 2.24s train: 900/937 loss: 1.3096, time left (estimated): 0.61s train: 937 loss: 1.3034, time: 15.19s dev acc: 98.43%, precision: 83.86%, recall: 83.54%, F1: 83.70% best dev acc: 98.51%, precision: 84.16%, recall: 83.58%, F1: 83.87% (epoch: 13) best test acc: 97.72%, precision: 79.45%, recall: 79.14%, F1: 79.30% (epoch: 13) Epoch 17 (LSTM(std), learning rate=0.0056, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.5186, time left (estimated): 13.96s train: 200/937 loss: 1.4127, time left (estimated): 11.87s train: 300/937 loss: 1.3337, time left (estimated): 9.97s train: 400/937 loss: 1.3327, time left (estimated): 8.48s train: 500/937 loss: 1.3473, time left (estimated): 6.99s train: 600/937 loss: 1.3244, time left (estimated): 5.42s train: 700/937 loss: 1.3301, time left (estimated): 3.82s train: 800/937 loss: 1.3322, time left (estimated): 2.22s train: 900/937 loss: 1.3217, time left (estimated): 0.61s train: 937 loss: 1.3175, time: 15.08s dev acc: 98.46%, precision: 84.10%, recall: 83.46%, F1: 83.78% best dev acc: 98.51%, precision: 84.16%, recall: 83.58%, F1: 83.87% (epoch: 13) best test acc: 97.72%, precision: 79.45%, recall: 79.14%, F1: 79.30% (epoch: 13) Epoch 18 (LSTM(std), learning rate=0.0054, decay rate=0.0500 (schedule=1)): train: 100/937 loss: 1.2661, time left (estimated): 13.31s

Here is my log. You are using python 3.6, right? what is your pytorch version? Could you trying to using python 2.7 with pytorch 0.3.1 to re-run your experiments to see if it is the issue of the versions. loading embedding: glove from data/glove/glove.6B/glove.6B.100d.gz 2018-06-06 15:44:55,126 - NERCRF - INFO - Creating Alphabets 2018-06-06 15:44:55,126 - Create Alphabets - INFO - Creating Alphabets: data/alphabets/ner_crf/ 2018-06-06 15:44:56,115 - Create Alphabets - INFO - Total Vocabulary Size: 20102 2018-06-06 15:44:56,116 - Create Alphabets - INFO - Total Singleton Size: 9178 2018-06-06 15:44:56,120 - Create Alphabets - INFO - Total Vocabulary Size (w.o rare words): 19046 2018-06-06 15:44:56,499 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 23598 (8122) 2018-06-06 15:44:56,499 - Create Alphabets - INFO - Character Alphabet Size: 86 2018-06-06 15:44:56,499 - Create Alphabets - INFO - POS Alphabet Size: 47 2018-06-06 15:44:56,499 - Create Alphabets - INFO - Chunk Alphabet Size: 19 2018-06-06 15:44:56,499 - Create Alphabets - INFO - NER Alphabet Size: 10 2018-06-06 15:44:56,499 - NERCRF - INFO - Word Alphabet Size: 23598 2018-06-06 15:44:56,500 - NERCRF - INFO - Character Alphabet Size: 86 2018-06-06 15:44:56,500 - NERCRF - INFO - POS Alphabet Size: 47 2018-06-06 15:44:56,500 - NERCRF - INFO - Chunk Alphabet Size: 19 2018-06-06 15:44:56,500 - NERCRF - INFO - NER Alphabet Size: 10 2018-06-06 15:44:56,500 - NERCRF - INFO - Reading Data Reading data from data/conll2003/english/eng.train.bio.conll reading data: 10000 Total number of data: 14987 Reading data from data/conll2003/english/eng.dev.bio.conll Total number of data: 3466 Reading data from data/conll2003/english/eng.test.bio.conll Total number of data: 3684 oov: 339 2018-06-06 15:45:01,810 - NERCRF - INFO - constructing network... 2018-06-06 15:45:02,979 - NERCRF - INFO - Network: LSTM, num_layer=1, hidden=256, filter=30, tag_space=128, crf=bigram 2018-06-06 15:45:02,980 - NERCRF - INFO - training: l2: 0.000000, (#training data: 14987, batch: 16, unk replace: 0.00) 2018-06-06 15:45:02,980 - NERCRF - INFO - dropout(in, out, rnn): (0.33, 0.50, (0.33, 0.5)) Epoch 1 (LSTM(std), learning rate=0.0100, decay rate=0.0500 (schedule=1)): train: 937 loss: 3.6320, time: 23.30s
dev acc: 96.81%, precision: 86.45%, recall: 83.52%, F1: 84.96% best dev acc: 96.81%, precision: 86.45%, recall: 83.52%, F1: 84.96% (epoch: 1) best test acc: 95.90%, precision: 81.77%, recall: 80.05%, F1: 80.90% (epoch: 1) Epoch 2 (LSTM(std), learning rate=0.0095, decay rate=0.0500 (schedule=1)): train: 937 loss: 2.3164, time: 19.93s
dev acc: 97.53%, precision: 89.47%, recall: 87.39%, F1: 88.42% best dev acc: 97.53%, precision: 89.47%, recall: 87.39%, F1: 88.42% (epoch: 2) best test acc: 96.79%, precision: 85.61%, recall: 84.37%, F1: 84.98% (epoch: 2) Epoch 3 (LSTM(std), learning rate=0.0091, decay rate=0.0500 (schedule=1)): train: 937 loss: 2.0166, time: 20.60s
dev acc: 97.56%, precision: 89.06%, recall: 87.24%, F1: 88.14% best dev acc: 97.53%, precision: 89.47%, recall: 87.39%, F1: 88.42% (epoch: 2) best test acc: 96.79%, precision: 85.61%, recall: 84.37%, F1: 84.98% (epoch: 2) Epoch 4 (LSTM(std), learning rate=0.0087, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.9072, time: 21.19s
dev acc: 97.81%, precision: 91.33%, recall: 88.66%, F1: 89.97% best dev acc: 97.81%, precision: 91.33%, recall: 88.66%, F1: 89.97% (epoch: 4) best test acc: 97.20%, precision: 88.10%, recall: 85.98%, F1: 87.03% (epoch: 4) Epoch 5 (LSTM(std), learning rate=0.0083, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.8425, time: 20.10s
dev acc: 98.05%, precision: 92.23%, recall: 90.04%, F1: 91.12% best dev acc: 98.05%, precision: 92.23%, recall: 90.04%, F1: 91.12% (epoch: 5) best test acc: 97.27%, precision: 88.23%, recall: 86.63%, F1: 87.42% (epoch: 5) Epoch 6 (LSTM(std), learning rate=0.0080, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.7096, time: 20.70s
dev acc: 97.79%, precision: 92.15%, recall: 89.13%, F1: 90.62% best dev acc: 98.05%, precision: 92.23%, recall: 90.04%, F1: 91.12% (epoch: 5) best test acc: 97.27%, precision: 88.23%, recall: 86.63%, F1: 87.42% (epoch: 5) Epoch 7 (LSTM(std), learning rate=0.0077, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.7420, time: 22.69s
dev acc: 98.16%, precision: 91.95%, recall: 90.91%, F1: 91.43% best dev acc: 98.16%, precision: 91.95%, recall: 90.91%, F1: 91.43% (epoch: 7) best test acc: 97.38%, precision: 88.18%, recall: 87.82%, F1: 88.00% (epoch: 7)

Yes, I'm running Anaconda 4.5.1 with python 3.6.3, pytorch 0.4.0 (using your pytorch0.4 branch) and gensim 3.4.0. I'll set up the python 2 environment and will verify the results.

FYI. here is the first 35 epochs for python 2.7 with pytorch 0.4. I seems it converges slower than pytorch 0.3. But still approaches 90% F1 after 35 epochs. loading embedding: glove from data/glove/glove.6B/glove.6B.100d.gz 2018-06-06 16:25:56,009 - NERCRF - INFO - Creating Alphabets 2018-06-06 16:25:56,057 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 23598 (8122) 2018-06-06 16:25:56,058 - Create Alphabets - INFO - Character Alphabet Size: 86 2018-06-06 16:25:56,058 - Create Alphabets - INFO - POS Alphabet Size: 47 2018-06-06 16:25:56,058 - Create Alphabets - INFO - Chunk Alphabet Size: 19 2018-06-06 16:25:56,058 - Create Alphabets - INFO - NER Alphabet Size: 18 2018-06-06 16:25:56,058 - NERCRF - INFO - Word Alphabet Size: 23598 2018-06-06 16:25:56,058 - NERCRF - INFO - Character Alphabet Size: 86 2018-06-06 16:25:56,058 - NERCRF - INFO - POS Alphabet Size: 47 2018-06-06 16:25:56,058 - NERCRF - INFO - Chunk Alphabet Size: 19 2018-06-06 16:25:56,058 - NERCRF - INFO - NER Alphabet Size: 18 2018-06-06 16:25:56,058 - NERCRF - INFO - Reading Data Reading data from data/conll2003/english/eng.train.bioes.conll reading data: 10000 Total number of data: 14987 Reading data from data/conll2003/english/eng.dev.bioes.conll Total number of data: 3466 Reading data from data/conll2003/english/eng.test.bioes.conll Total number of data: 3684 oov: 339 2018-06-06 16:25:59,294 - NERCRF - INFO - constructing network... /home/max/.local/lib/python2.7/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1 "num_layers={}".format(dropout, num_layers)) 2018-06-06 16:25:59,314 - NERCRF - INFO - Network: LSTM, num_layer=1, hidden=256, filter=30, tag_space=128, crf=bigram 2018-06-06 16:25:59,315 - NERCRF - INFO - training: l2: 0.000000, (#training data: 14987, batch: 16, unk replace: 0.00) 2018-06-06 16:25:59,315 - NERCRF - INFO - dropout(in, out, rnn): (0.33, 0.50, (0.33, 0.5)) Epoch 1 (LSTM(std), learning rate=0.0100, decay rate=0.0500 (schedule=1)): train: 937 loss: 11.5858, time: 116.24s
dev acc: 94.64%, precision: 77.99%, recall: 71.49%, F1: 74.60% best dev acc: 94.64%, precision: 77.99%, recall: 71.49%, F1: 74.60% (epoch: 1) best test acc: 93.82%, precision: 76.13%, recall: 70.41%, F1: 73.16% (epoch: 1) Epoch 2 (LSTM(std), learning rate=0.0095, decay rate=0.0500 (schedule=1)): train: 937 loss: 3.1999, time: 125.24s
dev acc: 96.54%, precision: 85.75%, recall: 83.12%, F1: 84.41% best dev acc: 96.54%, precision: 85.75%, recall: 83.12%, F1: 84.41% (epoch: 2) best test acc: 95.70%, precision: 81.84%, recall: 79.64%, F1: 80.73% (epoch: 2) Epoch 3 (LSTM(std), learning rate=0.0091, decay rate=0.0500 (schedule=1)): train: 937 loss: 2.6765, time: 114.69s
dev acc: 96.89%, precision: 90.07%, recall: 84.40%, F1: 87.14% best dev acc: 96.89%, precision: 90.07%, recall: 84.40%, F1: 87.14% (epoch: 3) best test acc: 95.90%, precision: 85.93%, recall: 80.35%, F1: 83.05% (epoch: 3) Epoch 4 (LSTM(std), learning rate=0.0087, decay rate=0.0500 (schedule=1)): train: 937 loss: 2.3663, time: 107.77s
dev acc: 97.26%, precision: 89.77%, recall: 85.85%, F1: 87.77% best dev acc: 97.26%, precision: 89.77%, recall: 85.85%, F1: 87.77% (epoch: 4) best test acc: 96.40%, precision: 85.72%, recall: 81.82%, F1: 83.72% (epoch: 4) Epoch 5 (LSTM(std), learning rate=0.0083, decay rate=0.0500 (schedule=1)): train: 937 loss: 2.2414, time: 112.05s
dev acc: 97.48%, precision: 88.71%, recall: 88.37%, F1: 88.54% best dev acc: 97.48%, precision: 88.71%, recall: 88.37%, F1: 88.54% (epoch: 5) best test acc: 96.54%, precision: 84.67%, recall: 84.95%, F1: 84.81% (epoch: 5) Epoch 6 (LSTM(std), learning rate=0.0080, decay rate=0.0500 (schedule=1)): train: 937 loss: 2.1981, time: 112.35s
dev acc: 97.58%, precision: 90.12%, recall: 89.04%, F1: 89.58% best dev acc: 97.58%, precision: 90.12%, recall: 89.04%, F1: 89.58% (epoch: 6) best test acc: 96.85%, precision: 87.28%, recall: 85.98%, F1: 86.62% (epoch: 6) Epoch 7 (LSTM(std), learning rate=0.0077, decay rate=0.0500 (schedule=1)): train: 937 loss: 2.0362, time: 114.91s
dev acc: 97.70%, precision: 92.14%, recall: 88.61%, F1: 90.34% best dev acc: 97.70%, precision: 92.14%, recall: 88.61%, F1: 90.34% (epoch: 7) best test acc: 96.89%, precision: 88.24%, recall: 84.24%, F1: 86.20% (epoch: 7) Epoch 8 (LSTM(std), learning rate=0.0074, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.8955, time: 111.44s
dev acc: 97.35%, precision: 89.69%, recall: 87.53%, F1: 88.60% best dev acc: 97.70%, precision: 92.14%, recall: 88.61%, F1: 90.34% (epoch: 7) best test acc: 96.89%, precision: 88.24%, recall: 84.24%, F1: 86.20% (epoch: 7) Epoch 9 (LSTM(std), learning rate=0.0071, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.9163, time: 106.08s
dev acc: 97.94%, precision: 91.67%, recall: 90.17%, F1: 90.91% best dev acc: 97.94%, precision: 91.67%, recall: 90.17%, F1: 90.91% (epoch: 9) best test acc: 97.14%, precision: 88.07%, recall: 86.88%, F1: 87.47% (epoch: 9) Epoch 10 (LSTM(std), learning rate=0.0069, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.8767, time: 110.97s
dev acc: 97.96%, precision: 92.15%, recall: 90.07%, F1: 91.10% best dev acc: 97.96%, precision: 92.15%, recall: 90.07%, F1: 91.10% (epoch: 10) best test acc: 97.07%, precision: 87.82%, recall: 86.07%, F1: 86.94% (epoch: 10) Epoch 11 (LSTM(std), learning rate=0.0067, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.8514, time: 113.16s
dev acc: 97.82%, precision: 91.57%, recall: 90.27%, F1: 90.92% best dev acc: 97.96%, precision: 92.15%, recall: 90.07%, F1: 91.10% (epoch: 10) best test acc: 97.07%, precision: 87.82%, recall: 86.07%, F1: 86.94% (epoch: 10) Epoch 12 (LSTM(std), learning rate=0.0065, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.7597, time: 108.15s
dev acc: 98.15%, precision: 92.33%, recall: 91.22%, F1: 91.77% best dev acc: 98.15%, precision: 92.33%, recall: 91.22%, F1: 91.77% (epoch: 12) best test acc: 97.16%, precision: 87.74%, recall: 87.32%, F1: 87.53% (epoch: 12) Epoch 13 (LSTM(std), learning rate=0.0062, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.7508, time: 111.77s
dev acc: 98.06%, precision: 92.18%, recall: 90.71%, F1: 91.44% best dev acc: 98.15%, precision: 92.33%, recall: 91.22%, F1: 91.77% (epoch: 12) best test acc: 97.16%, precision: 87.74%, recall: 87.32%, F1: 87.53% (epoch: 12) Epoch 14 (LSTM(std), learning rate=0.0061, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.7144, time: 107.66s
dev acc: 98.05%, precision: 92.76%, recall: 90.61%, F1: 91.67% best dev acc: 98.15%, precision: 92.33%, recall: 91.22%, F1: 91.77% (epoch: 12) best test acc: 97.16%, precision: 87.74%, recall: 87.32%, F1: 87.53% (epoch: 12) Epoch 15 (LSTM(std), learning rate=0.0059, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.6631, time: 113.54s
dev acc: 98.13%, precision: 92.51%, recall: 91.01%, F1: 91.75% best dev acc: 98.15%, precision: 92.33%, recall: 91.22%, F1: 91.77% (epoch: 12) best test acc: 97.16%, precision: 87.74%, recall: 87.32%, F1: 87.53% (epoch: 12) Epoch 16 (LSTM(std), learning rate=0.0057, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.6694, time: 115.85s
dev acc: 98.08%, precision: 92.43%, recall: 90.83%, F1: 91.62% best dev acc: 98.15%, precision: 92.33%, recall: 91.22%, F1: 91.77% (epoch: 12) best test acc: 97.16%, precision: 87.74%, recall: 87.32%, F1: 87.53% (epoch: 12) Epoch 17 (LSTM(std), learning rate=0.0056, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.6892, time: 115.00s
dev acc: 98.20%, precision: 92.69%, recall: 91.27%, F1: 91.97% best dev acc: 98.20%, precision: 92.69%, recall: 91.27%, F1: 91.97% (epoch: 17) best test acc: 97.30%, precision: 89.00%, recall: 87.64%, F1: 88.31% (epoch: 17) Epoch 18 (LSTM(std), learning rate=0.0054, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5907, time: 108.94s
dev acc: 98.17%, precision: 93.07%, recall: 91.59%, F1: 92.32% best dev acc: 98.17%, precision: 93.07%, recall: 91.59%, F1: 92.32% (epoch: 18) best test acc: 97.39%, precision: 89.51%, recall: 88.21%, F1: 88.85% (epoch: 18) Epoch 19 (LSTM(std), learning rate=0.0053, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5726, time: 110.24s
dev acc: 98.24%, precision: 93.42%, recall: 91.47%, F1: 92.43% best dev acc: 98.24%, precision: 93.42%, recall: 91.47%, F1: 92.43% (epoch: 19) best test acc: 97.42%, precision: 89.85%, recall: 87.91%, F1: 88.87% (epoch: 19) Epoch 20 (LSTM(std), learning rate=0.0051, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5618, time: 110.93s
dev acc: 98.08%, precision: 92.10%, recall: 90.98%, F1: 91.53% best dev acc: 98.24%, precision: 93.42%, recall: 91.47%, F1: 92.43% (epoch: 19) best test acc: 97.42%, precision: 89.85%, recall: 87.91%, F1: 88.87% (epoch: 19) Epoch 21 (LSTM(std), learning rate=0.0050, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5315, time: 114.51s
dev acc: 98.24%, precision: 93.34%, recall: 91.55%, F1: 92.44% best dev acc: 98.24%, precision: 93.34%, recall: 91.55%, F1: 92.44% (epoch: 21) best test acc: 97.39%, precision: 89.59%, recall: 87.73%, F1: 88.65% (epoch: 21) Epoch 22 (LSTM(std), learning rate=0.0049, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5707, time: 111.92s
dev acc: 98.34%, precision: 93.33%, recall: 92.34%, F1: 92.83% best dev acc: 98.34%, precision: 93.33%, recall: 92.34%, F1: 92.83% (epoch: 22) best test acc: 97.40%, precision: 89.47%, recall: 88.49%, F1: 88.98% (epoch: 22) Epoch 23 (LSTM(std), learning rate=0.0048, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5023, time: 109.71s
dev acc: 98.34%, precision: 93.17%, recall: 92.58%, F1: 92.88% best dev acc: 98.34%, precision: 93.17%, recall: 92.58%, F1: 92.88% (epoch: 23) best test acc: 97.45%, precision: 89.12%, recall: 88.79%, F1: 88.96% (epoch: 23) Epoch 24 (LSTM(std), learning rate=0.0047, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5445, time: 118.68s
dev acc: 98.29%, precision: 93.96%, recall: 91.62%, F1: 92.77% best dev acc: 98.34%, precision: 93.17%, recall: 92.58%, F1: 92.88% (epoch: 23) best test acc: 97.45%, precision: 89.12%, recall: 88.79%, F1: 88.96% (epoch: 23) Epoch 25 (LSTM(std), learning rate=0.0045, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5255, time: 114.08s
dev acc: 98.30%, precision: 93.43%, recall: 92.17%, F1: 92.80% best dev acc: 98.34%, precision: 93.17%, recall: 92.58%, F1: 92.88% (epoch: 23) best test acc: 97.45%, precision: 89.12%, recall: 88.79%, F1: 88.96% (epoch: 23) Epoch 26 (LSTM(std), learning rate=0.0044, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.5290, time: 113.38s
dev acc: 98.37%, precision: 93.29%, recall: 92.49%, F1: 92.89% best dev acc: 98.37%, precision: 93.29%, recall: 92.49%, F1: 92.89% (epoch: 26) best test acc: 97.52%, precision: 89.55%, recall: 88.95%, F1: 89.25% (epoch: 26) Epoch 27 (LSTM(std), learning rate=0.0043, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.4693, time: 111.95s
dev acc: 98.31%, precision: 93.06%, recall: 92.28%, F1: 92.67% best dev acc: 98.37%, precision: 93.29%, recall: 92.49%, F1: 92.89% (epoch: 26) best test acc: 97.52%, precision: 89.55%, recall: 88.95%, F1: 89.25% (epoch: 26) Epoch 28 (LSTM(std), learning rate=0.0043, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.3779, time: 105.43s
dev acc: 98.39%, precision: 93.44%, recall: 92.34%, F1: 92.89% best dev acc: 98.37%, precision: 93.29%, recall: 92.49%, F1: 92.89% (epoch: 26) best test acc: 97.52%, precision: 89.55%, recall: 88.95%, F1: 89.25% (epoch: 26) Epoch 29 (LSTM(std), learning rate=0.0042, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.4463, time: 117.16s
dev acc: 98.38%, precision: 93.51%, recall: 92.33%, F1: 92.91% best dev acc: 98.38%, precision: 93.51%, recall: 92.33%, F1: 92.91% (epoch: 29) best test acc: 97.61%, precision: 89.99%, recall: 88.81%, F1: 89.40% (epoch: 29) Epoch 30 (LSTM(std), learning rate=0.0041, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.4345, time: 108.91s
dev acc: 98.33%, precision: 93.25%, recall: 92.28%, F1: 92.76% best dev acc: 98.38%, precision: 93.51%, recall: 92.33%, F1: 92.91% (epoch: 29) best test acc: 97.61%, precision: 89.99%, recall: 88.81%, F1: 89.40% (epoch: 29) Epoch 31 (LSTM(std), learning rate=0.0040, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.4096, time: 111.84s
dev acc: 98.40%, precision: 93.50%, recall: 92.53%, F1: 93.01% best dev acc: 98.40%, precision: 93.50%, recall: 92.53%, F1: 93.01% (epoch: 31) best test acc: 97.61%, precision: 90.14%, recall: 89.47%, F1: 89.80% (epoch: 31) Epoch 32 (LSTM(std), learning rate=0.0039, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.4046, time: 113.07s
dev acc: 98.39%, precision: 93.94%, recall: 92.31%, F1: 93.12% best dev acc: 98.39%, precision: 93.94%, recall: 92.31%, F1: 93.12% (epoch: 32) best test acc: 97.58%, precision: 90.38%, recall: 88.79%, F1: 89.58% (epoch: 32) Epoch 33 (LSTM(std), learning rate=0.0038, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.4126, time: 111.48s
dev acc: 98.47%, precision: 93.98%, recall: 92.68%, F1: 93.32% best dev acc: 98.47%, precision: 93.98%, recall: 92.68%, F1: 93.32% (epoch: 33) best test acc: 97.56%, precision: 89.93%, recall: 88.56%, F1: 89.24% (epoch: 33) Epoch 34 (LSTM(std), learning rate=0.0038, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.3716, time: 107.51s
dev acc: 98.40%, precision: 93.87%, recall: 92.46%, F1: 93.16% best dev acc: 98.47%, precision: 93.98%, recall: 92.68%, F1: 93.32% (epoch: 33) best test acc: 97.56%, precision: 89.93%, recall: 88.56%, F1: 89.24% (epoch: 33) Epoch 35 (LSTM(std), learning rate=0.0037, decay rate=0.0500 (schedule=1)): train: 937 loss: 1.3615, time: 116.80s
dev acc: 98.39%, precision: 93.65%, recall: 92.38%, F1: 93.01% best dev acc: 98.47%, precision: 93.98%, recall: 92.68%, F1: 93.32% (epoch: 33) best test acc: 97.56%, precision: 89.93%, recall: 88.56%, F1: 89.24% (epoch: 33)

Hi @XuezheMax!

Besides running the python 2 setup (with pytorch 3.1), I also ran the script mentioned in #9 to add indexes to the start of each line in my corpus, to eliminate the possibility that I maybe did something wrong when adapting the code to run without the indexes. The results I got were compatible to yours, I got to near 90% F1 score on the test dataset on only 10 epochs.

Then I got back to the pytorch4.0 branch with python 3, reverted the changes I made to disregard the starting indexes and ran the training on the corpus with starting indexes again, to see if I had succeeded because of the corpus or because of the python and pytorch versions, and I ended up getting those same low results again. So looks like there's something wrong with running pytorch 4.0 on python 3 :thinking:

I didn't test pytorch 4.0 with python 2.7, I'm guessing you already did that. What you probably didn't do was testing with python 3.6, right?

Python 2.7 + Pytorch0.4 seems work well. My result on this config matches the paper. Running run_ner_crf.sh on CoNLL2003, I got F1 91.36% (better than the paper 91.21%) on epoch 167, but after that F1 reduced to 91.12%.

Epoch 167 (LSTM(std), learning rate=0.0011, decay rate=0.0500 (schedule=1)): train: 937 loss: 0.7290, time: 31.23s dev acc: 98.94%, precision: 94.79%, recall: 94.65%, F1: 94.72% best dev acc: 98.94%, precision: 94.79%, recall: 94.65%, F1: 94.72% (epoch: 167) best test acc: 98.14%, precision: 91.46%, recall: 91.25%, F1: 91.36% (epoch: 167)

These reported results are usually averaged after some number of executions, it doesn't actually mean that their highest individual training was 91.21%.

So if you ran with 2.7 and pytorch 0.4, I'm inclined to think that the problem must be related to python 3 somehow :thinking:

@ducalpha did you use the pytorch4.0 branch, or did you use the master?

I used the pytorch4.0 branch. The master branch yield an recursive stack exceeded error.

XuezheMax / NeuroNLP2

Trying to achieve same results as "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" paper #13

--

--