XuezheMax / NeuroNLP2

Deep neural models for core NLP tasks (Pytorch version)
GNU General Public License v3.0
440 stars 89 forks source link

conll03_data.py line 382 Error: float division by zero, runing ./example/run_ner_crf.sh #14

Closed jk78346 closed 6 years ago

jk78346 commented 6 years ago

Dear all, I'm trying to run run_ner_crf.sh on conll2003(English) for NER problem. The error I got is:

loading embedding: glove from data/glove/glove.6B/glove.6B.100d.gz
2018-05-13 11:40:13,149 - NERCRF - INFO - Creating Alphabets
2018-05-13 11:40:13,174 - Create Alphabets - INFO - Word Alphabet Size (Singleton): 23598 (8122)
2018-05-13 11:40:13,174 - Create Alphabets - INFO - Character Alphabet Size: 86
2018-05-13 11:40:13,174 - Create Alphabets - INFO - POS Alphabet Size: 47
2018-05-13 11:40:13,174 - Create Alphabets - INFO - Chunk Alphabet Size: 19
2018-05-13 11:40:13,174 - Create Alphabets - INFO - NER Alphabet Size: 9
2018-05-13 11:40:13,174 - NERCRF - INFO - Word Alphabet Size: 23598
2018-05-13 11:40:13,174 - NERCRF - INFO - Character Alphabet Size: 86
2018-05-13 11:40:13,174 - NERCRF - INFO - POS Alphabet Size: 47
2018-05-13 11:40:13,174 - NERCRF - INFO - Chunk Alphabet Size: 19
2018-05-13 11:40:13,174 - NERCRF - INFO - NER Alphabet Size: 9
2018-05-13 11:40:13,174 - NERCRF - INFO - Reading Data
Reading data from data/conll2003/NeuroNLP2_sep=s_eng_train
Total number of data: 1
Reading data from data/conll2003/NeuroNLP2_sep=s_eng_testa
Total number of data: 1
Reading data from data/conll2003/NeuroNLP2_sep=s_eng_testb
Total number of data: 1
oov: 339
2018-05-13 11:40:18,594 - NERCRF - INFO - constructing network...
/home/jayhsu/miniconda2/envs/py27/lib/python2.7/site-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
2018-05-13 11:40:25,038 - NERCRF - INFO - Network: LSTM, num_layer=1, hidden=200, filter=30, tag_space=128, crf=bigram
2018-05-13 11:40:25,038 - NERCRF - INFO - training: l2: 0.000000, (#training data: 0, batch: 10, unk replace: 0.00)
2018-05-13 11:40:25,038 - NERCRF - INFO - dropout(in, out, rnn): (0.33, 0.50, (0.33, 0.5))
Epoch 1 (LSTM(std), learning rate=0.0150, decay rate=0.0500 (schedule=1)):
Traceback (most recent call last):
  File "examples/NERCRF.py", line 250, in <module>
    main()
  File "examples/NERCRF.py", line 179, in main
    word, char, _, _, labels, masks, lengths = conll03_data.get_batch_tensor(data_train, batch_size, unk_replace=unk_replace)
  File "/home/jayhsu/NeuroNLP2/neuronlp2/io/conll03_data.py", line 382, in get_batch_tensor
    buckets_scale = [sum(bucket_sizes[:i + 1]) / total_size for i in range(len(bucket_sizes))]
ZeroDivisionError: float division by zero

And my setting is:

python 2.7.15 | Anaconda, Inc.|
pytorch 0.4.0
gensim 3.4.0

I also switch to branch pytorch4.0 of this repo.

and the format of conll2003 was modified to:

1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O
10 Peter NNP I-NP I-PER

Is there anything I did wrong? How can I run this successfully? Thanks in advance.

jk78346 commented 6 years ago

After some tracing, I add one print line within getNext() @./neuronlp2/io/reader.py, (and another print(inst...) line outside):

class CoNLL03Reader(object):
    def __init__(self, file_path, word_alphabet, char_alphabet, pos_alphabet, chunk_alphabet, ner_alphabet):
        self.__source_file = open(file_path, 'r')
        self.__word_alphabet = word_alphabet
        self.__char_alphabet = char_alphabet
        self.__pos_alphabet = pos_alphabet
        self.__chunk_alphabet = chunk_alphabet
        self.__ner_alphabet = ner_alphabet

    def close(self):
        self.__source_file.close()

    def getNext(self, normalize_digits=True):
        line = self.__source_file.readline()
        print("line = ", line, ", self.__source_file = ", self.__source_file)
        # skip multiple blank lines.
        while len(line) > 0 and len(line.strip()) == 0:
            line = self.__source_file.readline()
        if len(line) == 0:
            return None

And I got the following terminal result(part only):

Reading data from data/conll2003/NeuroNLP2_sep=s_eng_train.txt
('line = ', '1 EU NNP I-NP I-ORG\n', ', self.__source_file = ', <open file 'data/conll2003/NeuroNLP2_sep=s_eng_train.txt', mode 'r' at 0x7f18ff9af8a0>)
('inst = ', <neuronlp2.io.instance.NERInstance object at 0x7f18f623d750>)
('line = ', '', ', self.__source_file = ', <open file 'data/conll2003/NeuroNLP2_sep=s_eng_train.txt', mode 'r' at 0x7f18ff9af8a0>)
('inst = ', None)
Total number of data: 1

So, does it mean that python build-in readline() function has problem reading second line of a .txt file? Or most possibly what's wrong with my data format?

XuezheMax commented 6 years ago

It looks wired because the reader only got one training instance from your data. Would please paste more training instances for me to check your data format?

rsb3060 commented 6 years ago

Hello!

I think the line which it is reading ('line = ', '1 EU NNP I-NP I-ORG\n', ', is not in BIOES format. So there may be some mistake in data formatting in the dataset.

On Mon 14 May, 2018, 6:01 AM Max Ma, notifications@github.com wrote:

It looks wired because the reader only got one training instance from your data. Would please paste more training instances for me to check your data format?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/14#issuecomment-388667719, or mute the thread https://github.com/notifications/unsubscribe-auth/ARJU-WrWb16X6rHIGX1lYNKOeEm4PLoSks5tyNBXgaJpZM4T8pmJ .

jk78346 commented 6 years ago

@XuezheMax The following is part of my train data format:

1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O
10 Peter NNP I-NP I-PER
11 Blackburn NNP I-NP I-PER
12 BRUSSELS NNP I-NP I-LOC
13 1996-08-22 CD I-NP O
14 The DT I-NP O
15 European NNP I-NP I-ORG
16 Commission NNP I-NP I-ORG
17 said VBD I-VP O
18 on IN I-PP O
19 Thursday NNP I-NP O
20 it PRP B-NP O
21 disagreed VBD I-VP O
22 with IN I-PP O
23 German JJ I-NP I-MISC
24 advice NN I-NP O
25 to TO I-PP O
26 consumers NNS I-NP O
27 to TO I-VP O
28 shun VB I-VP O
29 British JJ I-NP I-MISC
30 lamb NN I-NP O
31 until IN I-SBAR O
32 scientists NNS I-NP O
33 determine VBP I-VP O
34 whether IN I-SBAR O
35 mad JJ I-NP O
36 cow NN I-NP O
37 disease NN I-NP O
38 can MD I-VP O
39 be VB I-VP O
40 transmitted VBN I-VP O
41 to TO I-PP O
42 sheep NN I-NP O
43 . . O O
44 Germany NNP I-NP I-LOC
45 's POS B-NP O
46 representative NN I-NP O
47 to TO I-PP O
48 the DT I-NP O
"./NeuroNLP2_sep=s_eng_train.txt" 204566L, 4589269C   

@rsb3060 I'm wondering that does the data format really effect whether the model code can be run or not, as long as it has five columns?

rsb3060 commented 6 years ago

Hello!

I think data will have effect on the preprocessing as the format of your data is considering all data words as a single sentence. So, I request you to put break line between each sentence. And start each initial sentence word with index one. That's the correct format.

Thank you,

On Mon 14 May, 2018, 6:44 AM jk78346, notifications@github.com wrote:

@XuezheMax https://github.com/XuezheMax The following is part of my train data format:

1 EU NNP I-NP I-ORG 2 rejects VBZ I-VP O 3 German JJ I-NP I-MISC 4 call NN I-NP O 5 to TO I-VP O 6 boycott VB I-VP O 7 British JJ I-NP I-MISC 8 lamb NN I-NP O 9 . . O O 10 Peter NNP I-NP I-PER 11 Blackburn NNP I-NP I-PER 12 BRUSSELS NNP I-NP I-LOC 13 1996-08-22 CD I-NP O 14 The DT I-NP O 15 European NNP I-NP I-ORG 16 Commission NNP I-NP I-ORG 17 said VBD I-VP O 18 on IN I-PP O 19 Thursday NNP I-NP O 20 it PRP B-NP O 21 disagreed VBD I-VP O 22 with IN I-PP O 23 German JJ I-NP I-MISC 24 advice NN I-NP O 25 to TO I-PP O 26 consumers NNS I-NP O 27 to TO I-VP O 28 shun VB I-VP O 29 British JJ I-NP I-MISC 30 lamb NN I-NP O 31 until IN I-SBAR O 32 scientists NNS I-NP O 33 determine VBP I-VP O 34 whether IN I-SBAR O 35 mad JJ I-NP O 36 cow NN I-NP O 37 disease NN I-NP O 38 can MD I-VP O 39 be VB I-VP O 40 transmitted VBN I-VP O 41 to TO I-PP O 42 sheep NN I-NP O 43 . . O O 44 Germany NNP I-NP I-LOC 45 's POS B-NP O 46 representative NN I-NP O 47 to TO I-PP O 48 the DT I-NP O "./NeuroNLP2_sep=s_eng_train.txt" 204566L, 4589269C

@rsb3060 https://github.com/rsb3060 I'm wondering that does the data format really effect whether the model code can be run or not, as long as it has five columns?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/14#issuecomment-388671203, or mute the thread https://github.com/notifications/unsubscribe-auth/ARJU-dqxEqWfvlf79-GedUBR9iT4C_36ks5tyNpngaJpZM4T8pmJ .

jk78346 commented 6 years ago

Hi, as this terminal output line indicates:

'1 EU NNP I-NP I-ORG\n'

I put '\n' at the end of each line, and separate each column with one space. Is this correct?

XuezheMax commented 6 years ago

@jk78346 There should be a break line between two sentences. Otherwise, the reader will treat them as a single one. The following is the correct format for your examples: 1 EU NNP I-NP I-ORG 2 rejects VBZ I-VP O 3 German JJ I-NP I-MISC 4 call NN I-NP O 5 to TO I-VP O 6 boycott VB I-VP O 7 British JJ I-NP I-MISC 8 lamb NN I-NP O 9 . . O O

1 Peter NNP I-NP I-PER 2 Blackburn NNP I-NP I-PER 3 BRUSSELS NNP I-NP I-LOC 4 1996-08-22 CD I-NP O ...

Moreover, your ner tagging schema is not BIO, please convert it correctly. @rsb3060 Thank you so much for your answer!

jk78346 commented 6 years ago

Really appreciate all of your answers. I realize that you mean 'break line between consecutive sentences', not '\n' for each 'line' of this .txt file. Thanks so much. Now it works.

udion commented 6 years ago

@jk78346 did it work without converting it to BIO?

jk78346 commented 6 years ago

I think I use the original type from conll03. I didn't change the format of tag column. It works. Conceptually I think the format of tag column won't affect, it's just a readable format defined by human.

XuezheMax commented 6 years ago

@jk78346 @udion I guess the original tagging type from conll03 works, but converting it to BIO (or more advanced BIOES) can improve the performance.

udion commented 6 years ago

thanks