aonotas / deep-crf

An implementation of Conditional Random Fields (CRFs) with Deep Learning Method
http://deep-crf.com
MIT License
168 stars 48 forks source link

TypeError: object of type 'int' has no len() #53

Open masakuri opened 6 years ago

masakuri commented 6 years ago

When I trained with English train/dev files, it worked. But when I trained with Japanese train/dev files (and set pre-trained Japanese word embeddings file), I got the following error.

  File "build/bdist.linux-x86_64/egg/deepcrf/__init__.py", line 66, in train
  File "build/bdist.linux-x86_64/egg/deepcrf/main.py", line 98, in run
  File "build/bdist.linux-x86_64/egg/deepcrf/util.py", line 102, in read_conll_file
TypeError: object of type 'int' has no len()

I want to set pre-trained Japanese char embeddings file, but it looks like there is not --char_emb_file option. I am wondering if this is the cause of the error. Does it support Japanese train/dev file (or --char_emb_file option) ? Thank you.

masakuri commented 6 years ago

I'm sorry, I typed incorrect command. ~The error was solved.~ I still have same error...

aonotas commented 6 years ago

Ok, please let me know your command.

masakuri commented 6 years ago
$ deep-crf train input_train_jp.txt --delimiter=" " --dev_file input_dev_jp.txt --save_dir save_jpmodel_dir --save_name bilstm-cnn-crf_adam_jp --optimizer adam --word_emb_file jp_word_emb300.txt --word_emb_vocab_type replace_only --gpu 0

Thank you.

aonotas commented 6 years ago

I think this error since your training file format input_train_jp.txt is wrong. Invalid input feature sizes.

I just fix code, please use recent version and please let me know the result. I think input_train_jp.txt should be:

彼 O
は O
オバマ大統領 S-PERSON
です O

彼 O
は O
masakuri commented 6 years ago

I got the following error. ValueError: Invalid input feature sizes: "3". Please check at line [1298]

I checked at line 1298 in input_train_jp.txt and I understood that the "word" has space like:

ほげ[space]ほげ[space]O

"ほげ[space]ほげ" is proper noun.

Thank you for your help to know this error cause. Is it OK to solve this problem by using --delimiter="\t" and input_train_jp.txt format is like ほげ[space]ほげ[tab]O ?

masakuri commented 6 years ago

I fix input_train_jp.txt format and I run the command ($ deep-crf train input_train_jp.txt --delimiter="\t" --dev_file input_dev_jp.txt --save_dir save_jpmodel_dir --save_name bilstm-cnn-crf_adam_jp --optimizer adam --word_emb_file jp_word_emb300.txt --word_emb_vocab_type replace_only --gpu 0), I got following error:

  File "build/bdist.linux-x86_64/egg/deepcrf/__init__.py", line 66, in train
  File "build/bdist.linux-x86_64/egg/deepcrf/main.py", line 102, in run
ValueError: Invalid training sizes: 0 sentences.

Any ideas?

aonotas commented 6 years ago

Is it OK to solve this problem by using --delimiter="\t" and input_train_jp.txt format is like ほげ[space]ほげ[tab]O ?

Yes! I think it is a good solution.

Each sentence must be split by a blank line (empty line \n) in input_train_jp.txt.

Note that you should put empty line (\n) between sentences. This format is called CoNLL format.

I mean if you have two sentences,

$ cat input_file.txt
Barack  B−PERSON 
Hussein I−PERSON 
Obama   E−PERSON
is      O 
a       O 
man     O 
.       O

Yuji   B−PERSON 
Matsumoto E−PERSON 
is     O 
a      O 
man    O 
.      O
masakuri commented 6 years ago

My input_train_jp.txt file has blank line ("\n") between sentences (more precisely, between tweets) but I got the error...

aonotas commented 6 years ago

Now your input_train_jp.txt seems following?

あああ[tab]O

あ[tab]O
い[tab]O
う[tab]O

お[space]お[tab]O
お[tab]O
masakuri commented 6 years ago

Now your input_train_jp.txt seems following?

あああ[tab]O

あ[tab]O い[tab]O う[tab]O

お[space]お[tab]O お[tab]O

Yes.

aonotas commented 6 years ago

OK. Can you send me your input file via e-mail if you are ok. nanigashi03[at] gmail.com

aonotas commented 6 years ago

Or, please try replace [tab] to [space] :

お[space]お   =>    お_お

[tab]   => [space]

and please use --delimiter=" ".

Maybe [tab] unicode causes this error?

masakuri commented 6 years ago

replace [tab] to [space]:

お[space]お => お_お

[tab] => [space] use --delimiter=" "

It worked!!! Thank you very much for your help!!!

aonotas commented 6 years ago

OK. It seems our code or input format with [tab] will cause that error.

masakuri commented 6 years ago

I see. Thank you very much. I changed the issue title to know the content.