Open duesXMachine opened 6 years ago
Hi @duesXMachine the code has a variable 'embeddingsPath' where you specify the path to the embeddings file. In Train_Chunking.py it is in line 39 an specifies 'levy_deps.words' as path, which are the dependency-based embeddings by Levy et al.
You can specify any embeddings that is in the suitable text format (one embedding per line, line starts by the word followed by the float-point values for the embedding)
Thanks @nreimers for this quick answer.
Hey @nreimers can I use word2vec format for embeddings.I am thinking of creationg word embedding in binary or text format using Genism.
Hi @duesXMachine the binary format is not working. It must be in a text format like the embeddings from Levy et al. (https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/):
word1 0.34 0.41 0.71
word2 0.12 0.34 0.33
...
One word per line separated by a white space the word and the individual dimensions for embedding.
When you have trained embeddings with Gensim, you can easily store it in this format and then use it for the BiLSTM-CRF architecture
I generated embeddings using Genism in text format and did 'charEmbeddings': None.But while running RunModel.py I am getting this error
/home/deusxmachine/.local/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "RunModel.py", line 22, in <module>
lstmModel.loadModel(modelPath)
File "/home/deusxmachine/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 582, in loadModel
self.maxCharLen = int(f.attrs['maxCharLen'])
ValueError: invalid literal for int() with base 10: 'None'
You can try and change line 581 to:
if 'maxCharLen' in f.attrs and f.attrs['maxCharLen'] is not None:
I tried that..actually f.attrs['maxCharLen']
returns 'None'
not None
.Its a string.Even if I handle that by doing:
if 'maxCharLen' in f.attrs and f.attrs['maxCharLen'] != 'None':
I get error again:
Using TensorFlow backend.
/home/deusxmachine/.local/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
> /home/deusxmachine/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py(578)loadModel()
-> mappings = json.loads(f.attrs['mappings'])
(Pdb) n
> /home/deusxmachine/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py(579)loadModel()
-> if 'additionalFeatures' in f.attrs:
(Pdb) n
> /home/deusxmachine/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py(580)loadModel()
-> self.additionalFeatures = json.loads(f.attrs['additionalFeatures'])
(Pdb) n
> /home/deusxmachine/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py(582)loadModel()
-> if 'maxCharLen' in f.attrs and f.attrs['maxCharLen'] != 'None':
(Pdb) n
> /home/deusxmachine/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py(585)loadModel()
-> self.model = model
(Pdb) c
Traceback (most recent call last):
File "RunModel.py", line 22, in <module>
lstmModel.loadModel(modelPath)
File "/home/deusxmachine/.local/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 93, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/home/deusxmachine/.local/lib/python2.7/site-packages/nltk/data.py", line 808, in load
opened_resource = _open(resource_url)
File "/home/deusxmachine/.local/lib/python2.7/site-packages/nltk/data.py", line 926, in _open
return find(path_, path + ['']).open()
File "/home/deusxmachine/.local/lib/python2.7/site-packages/nltk/data.py", line 648, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/deusxmachine/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************
As the error mentions, you must download the NLTK models for the english pickle
Run:
python -m nltk.downloader -d /home/deusxmachine/nltk_data punkt
Yes my bad I did read it properly.Thanks @nreimers.
And change that:
if 'maxCharLen' in f.attrs and f.attrs['maxCharLen'] != 'None':
Hey @nreimers I enabled charEmbeddings but for some reason its taking too much memory and ends up crashing my server.I am using 4 V-cores and 15 GB RAM + 10 GB swap memory.
Hi @duesXMachine How much RAM is used without charEmbeddings?
For using the charEmbeddings, all words are padded to the longest word in the corpus. If you have one really long word in your data, for example, an URL, all words might be padded to 200 characters. This of course can require some memory.
Maybe you can check the variable of maxCharLen. If it is too big, it might be wise to limit the length of the words, e.g. to 30 characters. Words longer than that must be truncated.
Hey @nreimers before I was using 4GB RAM on my PC without charEmbeddings and it worked.I think you are right I might have large urls in my dataset as words.And how are you encoding words to 1 D vectors for 1D convolution input?Which encoding?
My maxCharLen
is 54
Hey @nreimers to re-train a model all I am doing is loading it and then training.Is there something to change or to keep in mind while re-training it.
Hi @duesXMachine . Loading the model and continue the training is fine, no need to change something after loading.
Thanks @nreimers ..is there some function to compute confidence
Hi @duesXMachine . What do you mean with confidence?
When you use a softmax as a classifier, the value can be interpreted as probability / confidence for the different available tags. For the CRF-Classifier, this computation is much more difficult. The scores for the different taggings would needed to be computed an compared.
However, I find the confidence values computed by a network often meaningless. For error cases we often see really high confidence values of >99%, same for difficult instances, we see that the network often has a really high confidence. Sadly the confidence / probability returned by softmax is not a good approximator how likely the label is correct or how easy / difficult the word was to tag. So in most application scenarios this value is not really useful.
Hey @nreimers So is there any other work around to get probability/confidence. Actually I need to check network's confidence if its greater or less than certain threshold, just to check when its reliable and when not.
The easiest way is to use softmax as a classifier. In line 138 of BiLSTM.py you find this line:
predictions = self.model.predict([inputData[name] for name in features], verbose=False)
predictions = predictions.argmax(axis=-1) #Predict classes
The first self.model.predict(...) predicts the probabilities for the different labels and predictions.argmax(...) transforms this to the concrete label.
You can store self.model.predict(...) in a variable an inspect if it gives any meaningful numbers. Not sure how the values in self.model.predict(...) look for a CRF classifier.
Okay thanks @nreimers will look into it.Another thing whats that casing embedding layer?How useful is it?
Most pre-trained word embeddings only store information about lower cased words, i.e., the information of the casing of a word gets lost. The casing layer provides the information about the casing of the word, e.g., all uppercase, initial character is upper case, all lowercase etc.
This is especially useful for NER, were casing provides a lot of information. However, in noisy data, the casing of words can be wrong. This can cause problems for many models, e.g. in a sentence that IS ALL UPPERCASE, many NER models output that all words are named entities (because they were spelled in uppercase letters). In that case, it is better to remove the casing layer. The performance on standard NLP dataset will drop, but the system will work much better on noisy data.
Right now I am working on a dataset with common casing i.e all upper case.Should I just remove that casing embedding layer or there is an option to disable it
If everything has the same casing, the layer will do no harm. Only issue is if the training data has a correct casing, while your real data might have wrong casing (e.g. training correctly cased while test/real data can have all lower cased / all upper cased).
Currently there is no easy option to disable it, you would need to update the BiLSTM.py and remove the layer by hand from the network (or add an option for removing it to the network)
Hey @nreimers I was thinking is there a way to add a start and end sentence token like <START><END>
to lstms. Actually in my sentence there are multiple sentences.I cant seperate them all cause they are dependent.Is there a way to achieve this or I have create a new network.
LSTMs (and RNNs in general) often have issues to encode long range dependencies. So I'm not sure if the network (or any network) is able to figure out the dependencies between sentences. But you can try.
You can enrich your train/dev/test data and add a special token (like
Hey @nreimers I tried reloading the model to re-train it but got error:
--------- Epoch 1 -----------
Traceback (most recent call last):
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1021, in _do_call
return fn(*args)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,186] = 37451 is not in [0, 37444)
[[Node: Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](token_emd_W/read, _recv_embedding_input_1_0)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Train_NER_German.py", line 90, in <module>
model.evaluate(50)
File "/home/prashantsharma2476/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 391, in evaluate
self.trainModel()
File "/home/prashantsharma2476/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 107, in trainModel
self.model.train_on_batch(nnInput, labels)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/models.py", line 766, in train_on_batch
class_weight=class_weight)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/engine/training.py", line 1320, in train_on_batch
outputs = self.train_function(ins)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 1943, in __call__
feed_dict=feed_dict)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,186] = 37451 is not in [0, 37444)
[[Node: Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](token_emd_W/read, _recv_embedding_input_1_0)]]
Caused by op 'Gather', defined at:
File "Train_NER_German.py", line 84, in <module>
model.loadModel(sys.argv[1])
File "/home/prashantsharma2476/emnlp2017-bilstm-cnn-crf/neuralnets/BiLSTM.py", line 574, in loadModel
model = keras.models.load_model(modelPath, custom_objects=create_custom_objects())
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/models.py", line 142, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/models.py", line 193, in model_from_config
return layer_from_config(config, custom_objects=custom_objects)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/utils/layer_utils.py", line 42, in layer_from_config
return layer_class.from_config(config['config'])
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/models.py", line 1079, in from_config
merge_input = layer_from_config(merge_input_config)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/utils/layer_utils.py", line 42, in layer_from_config
return layer_class.from_config(config['config'])
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/models.py", line 1086, in from_config
model.add(layer)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/models.py", line 299, in add
layer.create_input_layer(batch_input_shape, input_dtype)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/engine/topology.py", line 401, in create_input_layer
self(x)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/engine/topology.py", line 572, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/engine/topology.py", line 635, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/engine/topology.py", line 166, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/layers/embeddings.py", line 128, in call
out = K.gather(W, x)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 960, in gather
return tf.gather(reference, indices)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1293, in gather
validate_indices=validate_indices, name=name)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): indices[0,186] = 37451 is not in [0, 37444)
[[Node: Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](token_emd_W/read, _recv_embedding_input_1_0)]]
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f2c054c9e10>>
Traceback (most recent call last):
File "/home/prashantsharma2476/bilstm/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 581, in __del__
UnboundLocalError: local variable 'status' referenced before assignment
Hi @duesXMachine I sadly haven't seen this issue before. Looks like some internal problem with tensorflow?
The code works best if theano is used as backend.
I'm currently working on converting the code to Keras 2 & Tensorflow, hopefully I can finish that soon.
Hey @nreimers I re-trained the model using the following code:
if len(sys.argv) == 2:
print('Loading Pre-Trained model::'+sys.argv[1])
model = BiLSTM(params)
model.loadModel(sys.argv[1])
model.setMappings(embeddings, data['mappings'])
model.setTrainDataset(data, labelKey)
model.verboseBuild = True
model.modelSavePath = "models/%s/%s/[DevScore]_[TestScore]_[Epoch].h5" % (datasetName, labelKey) # Enable this line to save the model to the disk
model.evaluate(50)
else:
model = BiLSTM(params)
model.setMappings(embeddings, data['mappings'])
model.setTrainDataset(data, labelKey)
model.verboseBuild = True
model.modelSavePath = "models/%s/%s/[DevScore]_[TestScore]_[Epoch].h5" % (datasetName, labelKey) #Enable this line to save the model to the disk
model.evaluate(50)
Now when I load the model for testing I got the following error:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/deusxmachine/emailanalyzer/src/main/python/analytics/analyzer.py", line 100, in run
self.predict()
File "/home/deusxmachine/emailanalyzer/src/main/python/analytics/analyzer.py", line 25, in predict
self.bilstm = Model(BILSTM_MODEL)
File "/home/deusxmachine/emailanalyzer/src/main/python/analytics/bilstm/model.py", line 18, in __init__
self.lstmModel.loadModel(modelPath)
File "/home/deusxmachine/emailanalyzer/src/main/python/analytics/bilstm/neuralnets/BiLSTM.py", line 576, in loadModel
model = keras.models.load_model(modelPath, custom_objects=create_custom_objects())
File "/home/deusxmachine/email/lib/python3.6/site-packages/keras/models.py", line 142, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "/home/deusxmachine/email/lib/python3.6/site-packages/keras/models.py", line 193, in model_from_config
return layer_from_config(config, custom_objects=custom_objects)
File "/home/deusxmachine/email/lib/python3.6/site-packages/keras/utils/layer_utils.py", line 42, in layer_from_config
return layer_class.from_config(config['config'])
File "/home/deusxmachine/email/lib/python3.6/site-packages/keras/models.py", line 1090, in from_config
layer = get_or_create_layer(conf)
File "/home/deusxmachine/email/lib/python3.6/site-packages/keras/models.py", line 1069, in get_or_create_layer
layer = layer_from_config(layer_data)
File "/home/deusxmachine/email/lib/python3.6/site-packages/keras/utils/layer_utils.py", line 35, in layer_from_config
instantiate=False)
File "/home/deusxmachine/email/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 125, in get_from_module
str(identifier))
ValueError: Invalid layer: ClassWrapper
When I load a new model(which has not been re-trained) I don't get any such error.Does it has something to do with casing embedding or I re-trained it wrong.
Sadly I don't know why this error happens.
Looks like that keras is not able to read the information about the configuration of the network, maybe the serialization is broken after loading and saving the model again?
I would recommend to try it with a softmax classifier. The CRF modul is a custom layer, maybe the issue is related to CRF that storing - loading - storing does not work for custom layers.
Hi @duesXMachine
I found the issue. The CRF layer is no longer names CRF layer if you load & store the model. To fix it, you must update /neuralnets/keraslayers/ChainCRF.py
Change the return value of create_custom_objects():
to:
return {'ChainCRF': ClassWrapper, 'ClassWrapper': ClassWrapper, 'loss': loss, 'sparse_loss': sparse_loss}
I just released a new (improved) version of this code that works with Keras 2.1.5 and Tensorflow 1.7.0. In that version, this bug is fixed.
Hey @nreimers , So now I can retrain the model while using CRF Layer.
@duesXMachine Yes, it should work
@nreimers Training process is taking too much memory around 14GB with charEmbeddings : None
and
maxCharLength: 10
How large is your uncompressed word embeddings file?
Its 62.1 MB
Then I sadly don't know what the issue is, the model should be far smaller and it should not require 14GB memory.
Hey @nreimers two quick questions.Here you go
While training ner if I give a word with label that is not present in word_embedding then what happens is it converted to <UNKNOWN>
token?
While testing model file, if a new word is given to model which might not be present in model mapping dict then what happens and how does the model predicts label for that word?
Thanks in advance.:)
Hi @duesXMachine
Unknown words are mapped to the UNKNOWN
token. For these unknown word, labels are still inferred, however, the system does not see the word.
If you have a sentence like: Mark is the founder of Facebook
And assuming it would not know Facebook, then the sentence is transferred to:
Mark is the founder of UNKNOWN
It would still try to guess if UNKNOWN
is a named entity or not.
When you enable character-based word representations, then a word representation for Facebook would be derived from the characters.
Thanks @nreimers Can you tell me what does CharEmbeddingsSize does?
Hi @duesXMachine For the char based word representations, every character is mapped to an embedding (of size CharEmbeddingsSize), then an LSTM or CNN is used to derive an embedding for the token
Hey @nreimers I tried re-training a model with CharEmbedding : 'CNN'
and got the following error with CNN input layer dimension.
ValueError: Error when checking input: expected char_input to have 3 dimensions, but got array with shape (2, 136)
Hi @duesXMachine There was a bug that the characters of the word were not transformed to vectors when the model was loaded instead of build from scratch.
I pushed a bugfix to the BiLSTM.py file.
I also updated the dependencies so that it works with Keras 2.2.0 and Tensorflow 1.8.0.
There was a change in Keras/Tensorflow, so that CNN based character-word-representation do no longer support masking. Hence, old models that where trained with CharEmbedding: 'CNN' can not be loaded with the latest version.
I cloned a new version of this repo, but @duesXMachine's problem (of the error when checking char_input) is still occurring. To be more specific, I cloned this new version, trained on CoNLL-2003 with CNN char embedding using TensorFlow backend (by default), and ran a modified RunModel_CoNLL_Format.py:
inputColumns = {0: "tokens", 1: 'NER_BIO'}
# :: Prepare the input ::
sentences = readCoNLL(inputPath, inputColumns)
addCharInformation(sentences)
addCasingInformation(sentences)
# :: Load the model ::
lstmModel = BiLSTM.loadModel(modelPath)
dataMatrix = createMatrices(sentences, lstmModel.mappings, True)
print(lstmModel.computeF1(list(lstmModel.models.keys())[0], dataMatrix))
The last line will produce that exact same error.
However, if I download the pre-trained CoNLL-2003 English model, there is no such problem.
Okay that is strange. I tested it with the German models, which use charEmbeddings: CNN and they work. I tested it with RunModel.py, might be that there is a slight difference to the CoNLL.py file.
I will check it next week when I'm back in office
Thanks! Note that there is no problem if I do the regular training + RunModel.py with the trained model. The crash occurs only when I use the modified code snippet, shown above. The only changed lines are the first and the last line. So I suspect that it probably isn't the problem for different languages, but rather in the computeF1()
function.
Hey @nreimers sometime while testing model I get this exception
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f24f4cf4c88>>
Traceback (most recent call last):
File "/home/deusxmachine/venvs/emailanalyzer/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 712, in __del__
File "/home/deusxmachine/venvs/emailanalyzer/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 31, in __init__
TypeError: 'NoneType' object is not callable
What does that means?
It looks like tensorflow session is not initialized. But I am not sure why this is the case. Maybe some issue with keras or tensorflow.
@zhaofengwu I will check that next week and come back to you
By the way, if I add lstmModel.tagSentences(dataMatrix)
above my final line pasted above (which is print(lstmModel.computeF1(list(lstmModel.models.keys())[0], dataMatrix))
), it will work fine. Hope it could help you find the problem.
I have been trying to develop ner for domain specific English dataset.How to disable those german embeddings?