Open Rabia-Noureen opened 7 years ago
Is it necessary for the newly added data set to be in either CoNLL-2003 or BRAT format?
Yes. Recall that BRAT format is the same as plaintext when there is no annotation.
Will a simple amazon review data set file be able to work fine?
What's the format?
The Data set is a simple text document same just like yours so it means i can use it. But the data set is not divided into 3 parts test, valid and train(its a single file) so is it fine?
To perform NER on some plain texts using a pre-trained model:
python3.5 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en
Okay sir i will try to run my data set thanks
Do i have to make changes in the code about folder and file names for using a new unannotated data set file other than those provided in the deploy folder? If it can be done in the cmd kindly provide that script.
If you place your data in /data/example_unannotated_texts and use
python3.5 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en
you can use any filename you want for your texts.
When i had deleted the 2 data files already placed in /data/example_unannotated_texts/deploy and pasted my data set(36MB - Text document) into the folder i got the following error by running
python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en
C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src>python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en
NeuroNER version: 1.0-dev
TensorFlow version: 1.1.0
NeuroNER version: 1.0-dev
TensorFlow version: 1.1.0
{'character_embedding_dimension': 25,
'character_lstm_hidden_state_dimension': 25,
'check_for_digits_replaced_with_zeros': 1,
'check_for_lowercase': 1,
'dataset_text_folder': '../data/example_unannotated_texts',
'debug': 0,
'dropout_rate': 0.5,
'experiment_name': 'test',
'freeze_token_embeddings': 0,
'gradient_clipping_value': 5.0,
'learning_rate': 0.005,
'load_only_pretrained_token_embeddings': 0,
'main_evaluation_mode': 'conll',
'maximum_number_of_epochs': 100,
'number_of_cpu_threads': 8,
'number_of_gpus': 0,
'optimizer': 'sgd',
'output_folder': '../output',
'parameters_filepath': '.\parameters.ini',
'patience': 10,
'plot_format': 'pdf',
'pretrained_model_folder': '../trained_models/conll_2003_en',
'reload_character_embeddings': 1,
'reload_character_lstm': 1,
'reload_crf': 1,
'reload_feedforward': 1,
'reload_token_embeddings': 1,
'reload_token_lstm': 1,
'remap_unknown_tokens_to_unk': 1,
'spacylanguage': 'en',
'tagging_format': 'bioes',
'token_embedding_dimension': 100,
'token_lstm_hidden_state_dimension': 100,
'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt',
'tokenizer': 'spacy',
'train_model': 0,
'use_character_lstm': 1,
'use_crf': 1,
'use_pretrained_model': 1,
'verbose': 0}
Formatting deploy set from BRAT to CONLL... Traceback (most recent call last):
File "main.py", line 445, in
Do i need to convert my data set in any other specific format?
Could you please try converting your datasets to 'utf-8'?
E.g. you can do so on Microsoft Windows with Notepad++:
Sir I converted my data set into UTF-8-BOM and again i tried to run the script i got the memory error
C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master\NeuroNER-master\src>python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en
NeuroNER version: 1.0-dev
TensorFlow version: 1.1.0
NeuroNER version: 1.0-dev
TensorFlow version: 1.1.0
{'character_embedding_dimension': 25,
'character_lstm_hidden_state_dimension': 25,
'check_for_digits_replaced_with_zeros': 1,
'check_for_lowercase': 1,
'dataset_text_folder': '../data/example_unannotated_texts',
'debug': 0,
'dropout_rate': 0.5,
'experiment_name': 'test',
'freeze_token_embeddings': 0,
'gradient_clipping_value': 5.0,
'learning_rate': 0.005,
'load_only_pretrained_token_embeddings': 0,
'main_evaluation_mode': 'conll',
'maximum_number_of_epochs': 100,
'number_of_cpu_threads': 8,
'number_of_gpus': 0,
'optimizer': 'sgd',
'output_folder': '../output',
'parameters_filepath': '.\parameters.ini',
'patience': 10,
'plot_format': 'pdf',
'pretrained_model_folder': '../trained_models/conll_2003_en',
'reload_character_embeddings': 1,
'reload_character_lstm': 1,
'reload_crf': 1,
'reload_feedforward': 1,
'reload_token_embeddings': 1,
'reload_token_lstm': 1,
'remap_unknown_tokens_to_unk': 1,
'spacylanguage': 'en',
'tagging_format': 'bioes',
'token_embedding_dimension': 100,
'token_lstm_hidden_state_dimension': 100,
'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt',
'tokenizer': 'spacy',
'train_model': 0,
'use_character_lstm': 1,
'use_crf': 1,
'use_pretrained_model': 1,
'verbose': 0}
Formatting deploy set from BRAT to CONLL... Traceback (most recent call last):
File "main.py", line 445, in
Kindly point out the mistake as i have to show my results to my supervisor in the following week.
How large is your dataset and how much memory does NeuroNER use before crashing?
Sir my data set is 36.5 MB large and how can i check that how much memory does NeuroNER uses before crashing?
Sir i have used a small other data set and that is working fine. Now i want to use the output generated file with target expressions as an input to CNN. Kindly guide me where i can find that file in the output folder? I have attached the screenshot for the output folder.
Here are the results for my dataset C:\Users\Zill-E-Huma\AppData\Local\Programs\Python\Python35\NeuroNER-master_3\NeuroNER-master\src>python main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 NeuroNER version: 1.0-dev TensorFlow version: 1.1.0 {'character_embedding_dimension': 25, 'character_lstm_hidden_state_dimension': 25, 'check_for_digits_replaced_with_zeros': 1, 'check_for_lowercase': 1, 'dataset_text_folder': '../data/example_unannotated_texts', 'debug': 0, 'dropout_rate': 0.5, 'experiment_name': 'test', 'freeze_token_embeddings': 0, 'gradient_clipping_value': 5.0, 'learning_rate': 0.005, 'load_only_pretrained_token_embeddings': 0, 'main_evaluation_mode': 'conll', 'maximum_number_of_epochs': 100, 'number_of_cpu_threads': 8, 'number_of_gpus': 0, 'optimizer': 'sgd', 'output_folder': '../output', 'parameters_filepath': '.\parameters.ini', 'patience': 10, 'plot_format': 'pdf', 'pretrained_model_folder': '../trained_models/conll_2003_en', 'reload_character_embeddings': 1, 'reload_character_lstm': 1, 'reload_crf': 1, 'reload_feedforward': 1, 'reload_token_embeddings': 1, 'reload_token_lstm': 1, 'remap_unknown_tokens_to_unk': 1, 'spacylanguage': 'en', 'tagging_format': 'bioes', 'token_embedding_dimension': 100, 'token_lstm_hidden_state_dimension': 100, 'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt', 'tokenizer': 'spacy', 'train_model': 0, 'use_character_lstm': 1, 'use_crf': 1, 'use_pretrained_model': 1, 'verbose': 0} Formatting deploy set from BRAT to CONLL... Done. Converting CONLL from BIO to BIOES format... Done. Load dataset... done (509.31 seconds)
Starting epoch 0 Load token embeddings... done (108.15 seconds) number_of_token_original_case_found: 904 number_of_token_lowercase_found: 16446 number_of_token_digits_replaced_with_zeros_found: 31014 number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0 number_of_loaded_word_vectors: 48364 dataset.vocabulary_size: 48369 Load token embeddings from pretrained model... done (0.39 seconds) number_of_loaded_vectors: 3345 dataset.vocabulary_size: 48369 Load character embeddings from pretrained model... done (0.35 seconds) number_of_loaded_vectors: 86 dataset.alphabet_size: 146 Training completed in 113.31 seconds Predict labels for the deploy set Formatting 000_deploy set from CONLL to BRAT... Done. Finishing the experiment
Is it necessary for the newly added data set to be in either CoNLL-2003 or BRAT format?Will a simple amazon review data set file be able to work fine? If not kindly share a method for conversion into the required format. And what if i have a single data set file that is not divided into 3 files training, validation and test set?Is it okay?