Franck-Dernoncourt / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
MIT License
1.69k stars 476 forks source link

Newly added Data set format #24

Open Rabia-Noureen opened 7 years ago

Rabia-Noureen commented 7 years ago

Is it necessary for the newly added data set to be in either CoNLL-2003 or BRAT format?Will a simple amazon review data set file be able to work fine? If not kindly share a method for conversion into the required format. And what if i have a single data set file that is not divided into 3 files training, validation and test set?Is it okay?

Franck-Dernoncourt commented 7 years ago

Is it necessary for the newly added data set to be in either CoNLL-2003 or BRAT format?

Yes. Recall that BRAT format is the same as plaintext when there is no annotation.

Will a simple amazon review data set file be able to work fine?

What's the format?

Rabia-Noureen commented 7 years ago

The Data set is a simple text document same just like yours so it means i can use it. But the data set is not divided into 3 parts test, valid and train(its a single file) so is it fine?

Franck-Dernoncourt commented 7 years ago

Yes. From https://github.com/Franck-Dernoncourt/NeuroNER/blob/c32c1fcf62dc22da69200279ff95f2b3dac854d1/README.md#using-neuroner:

To perform NER on some plain texts using a pre-trained model:

python3.5 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en
Rabia-Noureen commented 7 years ago

Okay sir i will try to run my data set thanks

Rabia-Noureen commented 7 years ago

Do i have to make changes in the code about folder and file names for using a new unannotated data set file other than those provided in the deploy folder? If it can be done in the cmd kindly provide that script.

Franck-Dernoncourt commented 7 years ago

If you place your data in /data/example_unannotated_texts and use

 python3.5 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en

you can use any filename you want for your texts.

Rabia-Noureen commented 7 years ago

When i had deleted the 2 data files already placed in /data/example_unannotated_texts/deploy and pasted my data set(36MB - Text document) into the folder i got the following error by running

python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en

C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src>python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 {'character_embedding_dimension': 25, 'character_lstm_hidden_state_dimension': 25, 'check_for_digits_replaced_with_zeros': 1, 'check_for_lowercase': 1, 'dataset_text_folder': '../data/example_unannotated_texts', 'debug': 0, 'dropout_rate': 0.5, 'experiment_name': 'test', 'freeze_token_embeddings': 0, 'gradient_clipping_value': 5.0, 'learning_rate': 0.005, 'load_only_pretrained_token_embeddings': 0, 'main_evaluation_mode': 'conll', 'maximum_number_of_epochs': 100, 'number_of_cpu_threads': 8, 'number_of_gpus': 0, 'optimizer': 'sgd', 'output_folder': '../output', 'parameters_filepath': '.\parameters.ini', 'patience': 10, 'plot_format': 'pdf', 'pretrained_model_folder': '../trained_models/conll_2003_en', 'reload_character_embeddings': 1, 'reload_character_lstm': 1, 'reload_crf': 1, 'reload_feedforward': 1, 'reload_token_embeddings': 1, 'reload_token_lstm': 1, 'remap_unknown_tokens_to_unk': 1, 'spacylanguage': 'en', 'tagging_format': 'bioes', 'token_embedding_dimension': 100, 'token_lstm_hidden_state_dimension': 100, 'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt', 'tokenizer': 'spacy', 'train_model': 0, 'use_character_lstm': 1, 'use_crf': 1, 'use_pretrained_model': 1, 'verbose': 0} Formatting deploy set from BRAT to CONLL... Traceback (most recent call last): File "main.py", line 445, in main() File "main.py", line 268, in main dataset_filepaths, dataset_brat_folders = get_valid_dataset_filepaths(parameters) File "main.py", line 162, in get_valid_dataset_filepaths brat_to_conll.brat_to_conll(dataset_brat_folders[dataset_type], dataset_filepath_for_tokenizer, parameters['tokenizer'], parameters['spacylanguage']) File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src\brat_to_conll.py", line 141, in brat_to_conll text, entities = get_entities_from_brat(text_filepath, annotation_filepath) File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src\brat_to_conll.py", line 73, in get_entities_from_brat text =f.read() File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\lib\codecs.py", line 698, in read return self.reader.read(size) File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\lib\codecs.py", line 501, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 5143: invalid start byte

Do i need to convert my data set in any other specific format?

Franck-Dernoncourt commented 7 years ago

Could you please try converting your datasets to 'utf-8'?

E.g. you can do so on Microsoft Windows with Notepad++:

image

Rabia-Noureen commented 7 years ago

Sir I converted my data set into UTF-8-BOM and again i tried to run the script i got the memory error

C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src>python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 {'character_embedding_dimension': 25, 'character_lstm_hidden_state_dimension': 25, 'check_for_digits_replaced_with_zeros': 1, 'check_for_lowercase': 1, 'dataset_text_folder': '../data/example_unannotated_texts', 'debug': 0, 'dropout_rate': 0.5, 'experiment_name': 'test', 'freeze_token_embeddings': 0, 'gradient_clipping_value': 5.0, 'learning_rate': 0.005, 'load_only_pretrained_token_embeddings': 0, 'main_evaluation_mode': 'conll', 'maximum_number_of_epochs': 100, 'number_of_cpu_threads': 8, 'number_of_gpus': 0, 'optimizer': 'sgd', 'output_folder': '../output', 'parameters_filepath': '.\parameters.ini', 'patience': 10, 'plot_format': 'pdf', 'pretrained_model_folder': '../trained_models/conll_2003_en', 'reload_character_embeddings': 1, 'reload_character_lstm': 1, 'reload_crf': 1, 'reload_feedforward': 1, 'reload_token_embeddings': 1, 'reload_token_lstm': 1, 'remap_unknown_tokens_to_unk': 1, 'spacylanguage': 'en', 'tagging_format': 'bioes', 'token_embedding_dimension': 100, 'token_lstm_hidden_state_dimension': 100, 'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt', 'tokenizer': 'spacy', 'train_model': 0, 'use_character_lstm': 1, 'use_crf': 1, 'use_pretrained_model': 1, 'verbose': 0} Formatting deploy set from BRAT to CONLL... Traceback (most recent call last): File "main.py", line 445, in main() File "main.py", line 268, in main dataset_filepaths, dataset_brat_folders = get_valid_dataset_filepaths(parameters) File "main.py", line 162, in get_valid_dataset_filepaths brat_to_conll.brat_to_conll(dataset_brat_folders[dataset_type], dataset_filepath_for_tokenizer, parameters['tokenizer'], parameters['spacylanguage']) File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src\brat_to_conll.py", line 145, in brat_to_conll sentences = get_sentences_and_tokens_from_spacy(text, spacy_nlp) File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src\brat_to_conll.py", line 18, in get_sentences_and_tokens_from_spacy document = spacy_nlp(text) File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\lib\site-packages\spacy\language.py", line 320, in call doc = self.make_doc(text) File "C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\lib\site-packages\spacy\language.py", line 293, in self.make_doc = lambda text: self.tokenizer(text) File "spacy\tokenizer.pyx", line 150, in spacy.tokenizer.Tokenizer.call (spacy/tokenizer.cpp:5725) File "spacy\tokenizer.pyx", line 196, in spacy.tokenizer.Tokenizer._try_cache (spacy/tokenizer.cpp:6403) File "spacy\tokens\doc.pyx", line 469, in spacy.tokens.doc.Doc.push_back (spacy/tokens/doc.cpp:10847) MemoryError

Kindly point out the mistake as i have to show my results to my supervisor in the following week.

Franck-Dernoncourt commented 7 years ago

How large is your dataset and how much memory does NeuroNER use before crashing?

Rabia-Noureen commented 7 years ago

Sir my data set is 36.5 MB large and how can i check that how much memory does NeuroNER uses before crashing?

Rabia-Noureen commented 7 years ago

Sir i have used a small other data set and that is working fine. Now i want to use the output generated file with target expressions as an input to CNN. Kindly guide me where i can find that file in the output folder? I have attached the screenshot for the output folder. image

Here are the results for my dataset C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master_3\NeuroNER-master\src>python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 {'character_embedding_dimension': 25, 'character_lstm_hidden_state_dimension': 25, 'check_for_digits_replaced_with_zeros': 1, 'check_for_lowercase': 1, 'dataset_text_folder': '../data/example_unannotated_texts', 'debug': 0, 'dropout_rate': 0.5, 'experiment_name': 'test', 'freeze_token_embeddings': 0, 'gradient_clipping_value': 5.0, 'learning_rate': 0.005, 'load_only_pretrained_token_embeddings': 0, 'main_evaluation_mode': 'conll', 'maximum_number_of_epochs': 100, 'number_of_cpu_threads': 8, 'number_of_gpus': 0, 'optimizer': 'sgd', 'output_folder': '../output', 'parameters_filepath': '.\parameters.ini', 'patience': 10, 'plot_format': 'pdf', 'pretrained_model_folder': '../trained_models/conll_2003_en', 'reload_character_embeddings': 1, 'reload_character_lstm': 1, 'reload_crf': 1, 'reload_feedforward': 1, 'reload_token_embeddings': 1, 'reload_token_lstm': 1, 'remap_unknown_tokens_to_unk': 1, 'spacylanguage': 'en', 'tagging_format': 'bioes', 'token_embedding_dimension': 100, 'token_lstm_hidden_state_dimension': 100, 'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt', 'tokenizer': 'spacy', 'train_model': 0, 'use_character_lstm': 1, 'use_crf': 1, 'use_pretrained_model': 1, 'verbose': 0} Formatting deploy set from BRAT to CONLL... Done. Converting CONLL from BIO to BIOES format... Done. Load dataset... done (509.31 seconds)

Starting epoch 0 Load token embeddings... done (108.15 seconds) number_of_token_original_case_found: 904 number_of_token_lowercase_found: 16446 number_of_token_digits_replaced_with_zeros_found: 31014 number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0 number_of_loaded_word_vectors: 48364 dataset.vocabulary_size: 48369 Load token embeddings from pretrained model... done (0.39 seconds) number_of_loaded_vectors: 3345 dataset.vocabulary_size: 48369 Load character embeddings from pretrained model... done (0.35 seconds) number_of_loaded_vectors: 86 dataset.alphabet_size: 146 Training completed in 113.31 seconds Predict labels for the deploy set Formatting 000_deploy set from CONLL to BRAT... Done. Finishing the experiment