Closed jk78346 closed 6 years ago
After some tracing, I add one print line within getNext() @./neuronlp2/io/reader.py, (and another print(inst...) line outside):
class CoNLL03Reader(object):
def __init__(self, file_path, word_alphabet, char_alphabet, pos_alphabet, chunk_alphabet, ner_alphabet):
self.__source_file = open(file_path, 'r')
self.__word_alphabet = word_alphabet
self.__char_alphabet = char_alphabet
self.__pos_alphabet = pos_alphabet
self.__chunk_alphabet = chunk_alphabet
self.__ner_alphabet = ner_alphabet
def close(self):
self.__source_file.close()
def getNext(self, normalize_digits=True):
line = self.__source_file.readline()
print("line = ", line, ", self.__source_file = ", self.__source_file)
# skip multiple blank lines.
while len(line) > 0 and len(line.strip()) == 0:
line = self.__source_file.readline()
if len(line) == 0:
return None
And I got the following terminal result(part only):
Reading data from data/conll2003/NeuroNLP2_sep=s_eng_train.txt
('line = ', '1 EU NNP I-NP I-ORG\n', ', self.__source_file = ', <open file 'data/conll2003/NeuroNLP2_sep=s_eng_train.txt', mode 'r' at 0x7f18ff9af8a0>)
('inst = ', <neuronlp2.io.instance.NERInstance object at 0x7f18f623d750>)
('line = ', '', ', self.__source_file = ', <open file 'data/conll2003/NeuroNLP2_sep=s_eng_train.txt', mode 'r' at 0x7f18ff9af8a0>)
('inst = ', None)
Total number of data: 1
So, does it mean that python build-in readline() function has problem reading second line of a .txt file? Or most possibly what's wrong with my data format?
It looks wired because the reader only got one training instance from your data. Would please paste more training instances for me to check your data format?
Hello!
I think the line which it is reading ('line = ', '1 EU NNP I-NP I-ORG\n', ', is not in BIOES format. So there may be some mistake in data formatting in the dataset.
On Mon 14 May, 2018, 6:01 AM Max Ma, notifications@github.com wrote:
It looks wired because the reader only got one training instance from your data. Would please paste more training instances for me to check your data format?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/14#issuecomment-388667719, or mute the thread https://github.com/notifications/unsubscribe-auth/ARJU-WrWb16X6rHIGX1lYNKOeEm4PLoSks5tyNBXgaJpZM4T8pmJ .
@XuezheMax The following is part of my train data format:
1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O
10 Peter NNP I-NP I-PER
11 Blackburn NNP I-NP I-PER
12 BRUSSELS NNP I-NP I-LOC
13 1996-08-22 CD I-NP O
14 The DT I-NP O
15 European NNP I-NP I-ORG
16 Commission NNP I-NP I-ORG
17 said VBD I-VP O
18 on IN I-PP O
19 Thursday NNP I-NP O
20 it PRP B-NP O
21 disagreed VBD I-VP O
22 with IN I-PP O
23 German JJ I-NP I-MISC
24 advice NN I-NP O
25 to TO I-PP O
26 consumers NNS I-NP O
27 to TO I-VP O
28 shun VB I-VP O
29 British JJ I-NP I-MISC
30 lamb NN I-NP O
31 until IN I-SBAR O
32 scientists NNS I-NP O
33 determine VBP I-VP O
34 whether IN I-SBAR O
35 mad JJ I-NP O
36 cow NN I-NP O
37 disease NN I-NP O
38 can MD I-VP O
39 be VB I-VP O
40 transmitted VBN I-VP O
41 to TO I-PP O
42 sheep NN I-NP O
43 . . O O
44 Germany NNP I-NP I-LOC
45 's POS B-NP O
46 representative NN I-NP O
47 to TO I-PP O
48 the DT I-NP O
"./NeuroNLP2_sep=s_eng_train.txt" 204566L, 4589269C
@rsb3060 I'm wondering that does the data format really effect whether the model code can be run or not, as long as it has five columns?
Hello!
I think data will have effect on the preprocessing as the format of your data is considering all data words as a single sentence. So, I request you to put break line between each sentence. And start each initial sentence word with index one. That's the correct format.
Thank you,
On Mon 14 May, 2018, 6:44 AM jk78346, notifications@github.com wrote:
@XuezheMax https://github.com/XuezheMax The following is part of my train data format:
1 EU NNP I-NP I-ORG 2 rejects VBZ I-VP O 3 German JJ I-NP I-MISC 4 call NN I-NP O 5 to TO I-VP O 6 boycott VB I-VP O 7 British JJ I-NP I-MISC 8 lamb NN I-NP O 9 . . O O 10 Peter NNP I-NP I-PER 11 Blackburn NNP I-NP I-PER 12 BRUSSELS NNP I-NP I-LOC 13 1996-08-22 CD I-NP O 14 The DT I-NP O 15 European NNP I-NP I-ORG 16 Commission NNP I-NP I-ORG 17 said VBD I-VP O 18 on IN I-PP O 19 Thursday NNP I-NP O 20 it PRP B-NP O 21 disagreed VBD I-VP O 22 with IN I-PP O 23 German JJ I-NP I-MISC 24 advice NN I-NP O 25 to TO I-PP O 26 consumers NNS I-NP O 27 to TO I-VP O 28 shun VB I-VP O 29 British JJ I-NP I-MISC 30 lamb NN I-NP O 31 until IN I-SBAR O 32 scientists NNS I-NP O 33 determine VBP I-VP O 34 whether IN I-SBAR O 35 mad JJ I-NP O 36 cow NN I-NP O 37 disease NN I-NP O 38 can MD I-VP O 39 be VB I-VP O 40 transmitted VBN I-VP O 41 to TO I-PP O 42 sheep NN I-NP O 43 . . O O 44 Germany NNP I-NP I-LOC 45 's POS B-NP O 46 representative NN I-NP O 47 to TO I-PP O 48 the DT I-NP O "./NeuroNLP2_sep=s_eng_train.txt" 204566L, 4589269C
@rsb3060 https://github.com/rsb3060 I'm wondering that does the data format really effect whether the model code can be run or not, as long as it has five columns?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/XuezheMax/NeuroNLP2/issues/14#issuecomment-388671203, or mute the thread https://github.com/notifications/unsubscribe-auth/ARJU-dqxEqWfvlf79-GedUBR9iT4C_36ks5tyNpngaJpZM4T8pmJ .
Hi, as this terminal output line indicates:
'1 EU NNP I-NP I-ORG\n'
I put '\n' at the end of each line, and separate each column with one space. Is this correct?
@jk78346 There should be a break line between two sentences. Otherwise, the reader will treat them as a single one. The following is the correct format for your examples: 1 EU NNP I-NP I-ORG 2 rejects VBZ I-VP O 3 German JJ I-NP I-MISC 4 call NN I-NP O 5 to TO I-VP O 6 boycott VB I-VP O 7 British JJ I-NP I-MISC 8 lamb NN I-NP O 9 . . O O
1 Peter NNP I-NP I-PER 2 Blackburn NNP I-NP I-PER 3 BRUSSELS NNP I-NP I-LOC 4 1996-08-22 CD I-NP O ...
Moreover, your ner tagging schema is not BIO, please convert it correctly. @rsb3060 Thank you so much for your answer!
Really appreciate all of your answers. I realize that you mean 'break line between consecutive sentences', not '\n' for each 'line' of this .txt file. Thanks so much. Now it works.
@jk78346 did it work without converting it to BIO?
I think I use the original type from conll03. I didn't change the format of tag column. It works. Conceptually I think the format of tag column won't affect, it's just a readable format defined by human.
@jk78346 @udion I guess the original tagging type from conll03 works, but converting it to BIO (or more advanced BIOES) can improve the performance.
thanks
Dear all, I'm trying to run run_ner_crf.sh on conll2003(English) for NER problem. The error I got is:
And my setting is:
I also switch to branch pytorch4.0 of this repo.
and the format of conll2003 was modified to:
Is there anything I did wrong? How can I run this successfully? Thanks in advance.