YoungXiyuan / DCA

This repository contains code used in the EMNLP 2019 paper "Learning Dynamic Context Augmentation for Global Entity Linking".
https://arxiv.org/abs/1909.02117
46 stars 15 forks source link

Error after 55th epoch while saving the model #5

Closed SravyaMadupu closed 4 years ago

SravyaMadupu commented 4 years ago

The model is running fine for 55 epochs and but at the 55th epoch when it is to be saved it is throwing an error as below. Any idea about this error?

epoch 54 total loss 0.2056220420028012 0.00021576289821909886 aida-A micro F1: 0.9303830497860348 aida-B micro F1: 0.9400222965440357 msnbc micro F1: 0.9426166794185157 aquaint micro F1: 0.8755244755244754 ace2004 micro F1: 0.8933601609657947 clueweb micro F1: 0.742094861660079 wikipedia micro F1: 0.7821906663708305 change learning rate to 0.0001 att_mat_diag tok_score_mat_diag entity2entity_mat_diag entity2entity_score_mat_diag knowledge2entity_mat_diag knowledge2entity_score_mat_diag type_emb cnn.weight cnn.bias score_combine.0.weight score_combine.0.bias score_combine.3.weight score_combine.3.bias save model to model Traceback (most recent call last): File "main.py", line 226, in ranker.train(conll.train, dev_datasets, config) File "/content/drive/My Drive/data.tar.gz (Unzipped Files)/DCA/ed_ranker.py", line 1032, in train self.model.save(self.args.model_path) File "/content/drive/My Drive/data.tar.gz (Unzipped Files)/DCA/abstract_word_entity.py", line 78, in save json.dump(config, f) File "/usr/lib/python3.6/json/init.py", line 179, in dump for chunk in iterable: File "/usr/lib/python3.6/json/encoder.py", line 430, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "/usr/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict yield from chunks File "/usr/lib/python3.6/json/encoder.py", line 437, in _iterencode o = _default(o) File "/usr/lib/python3.6/json/encoder.py", line 180, in default o.class.name) TypeError: Object of type 'set' is not JSON serializable

YoungXiyuan commented 4 years ago

This is the first time that I have encountered this kind of problem which is never be reported by other users.

Did that error occur when saving the model for the first time? Maybe it occured due to the unstable platform environment.

According to the error log "File '/content/drive/My Drive/data.tar.gz (Unzipped Files)/DCA/abstract_word_entity.py', line 78, in save json.dump(config, f)", I guess that maybe you need to place the DCA project in a normal directory instead of a .tar.gz file, give it a try. (-:

SravyaMadupu commented 4 years ago

It is just a file name. It is a normal directory. I unzipped the folder to google drive and the name was auto-generated. It is a normal directory. :-| Maybe the error is because of the changes I made then.

I have made some changes to the code to run it on CPU. Using GPU each epoch is taking a lot of time. Strangely on CPU epoch is able to run in 4 mins and on GPU each epoch is taking more than 45 minutes. Is the behavior normal?

YoungXiyuan commented 4 years ago

I apologize for that I have never encountered that strange problem before.

First, I would like to check which training method you choose? Supervised Learning or Reinforcement learning?

Second, is it convenient for you to provide some hardware information about your platform? We trained our framework on the one GeForce GTX 1080 card with 8GB memory and two Intel(R) Xeon(R) CPUs (E5-2683 v3@2.00GHz), with sufficient memory size (about 384 GB) and SSD storage (about 3.1 TB).

I have the impression that the DCA framework should run faster on GPU than CPU, and under the Supervised Learning settting, each epoch costs significantly less than 45 minutes on GPU while costs much more than 4 mins on CPU.

Maybe you could keep the reivsed code running and then check whether the final results are normal or not.

SravyaMadupu commented 4 years ago

I am trying to run supervised learning with using the arguments: --mode train --order offset --model_path model --method SL

I am running the code in google colab: The configuration is RAM: 12 GB Disk space is: 64 GB.

Supervised learning using GPU is running for more than an hour now and still, the first epoch is not completed yet.

I am using the revised code and still not able to run it successfully. Also could you please tell me a rough estimate of the time it would take to run the code on the above-said configuration.

YoungXiyuan commented 4 years ago

I think that your basic hardware environment should be sufficient for DCA training, for that DCA framework is not resource-consuming.

And do you know what's the size of your GPU card?

Honestly speaking, I am a little confused about the current situation you are facing.

On the one hand, the title of this Issue is "Error after 55th epoch while saving the model " which means that you have trained DCA framework at least 54 epoches. On the other hand, you mentioned that "Supervised learning using GPU is running for more than an hour now and still, the first epoch is not completed yet".

So it seems that you have runned the code more than two days? But you opened an Issue "Unable to run on google colab" 11 hours ago.

So I would like to check that how do you know that the running code are still in the first epoch, by a log file, or printed text on the screen, or the output csv file?

Thanks.

SravyaMadupu commented 4 years ago

Sorry for all the confusion.

Let me explain so more in detail about my situation. For running the code on both CPU and GPU, I am using Google Colab. I opened the previous issue when I was trying to use the code as it is without any changes. I was using GPU initially and got CUDA out of memory.

Then I changed the code so that all the computations are done on CPU instead of GPU. Now when I am running the code on CPU, I encountered the error at epoch 55 while saving the model. It took 6 hours to run and stopped with the above-said error.

Then I tried to run the actual code again in GPU magically it started running without any memory issue. However, it is taking a lot of time for each epoch when I am trying to run on a GPU.

I started the current session 2 hours back and the log is as follows:

load conll at ../data/generated/test_train_data load csv 370United News of India process coref load conll reorder mentions within the dataset create model tcmalloc: large alloc 1181786112 bytes == 0xc900000 @ 0x7f30d5b021e7 0x7f30cfec45e1 0x7f30cff2d90d 0x7f30cff2e522 0x7f30cffc5bce 0x50a7f5 0x50cfd6 0x507f24 0x509c50 0x50a64d 0x50cfd6 0x507f24 0x509c50 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 0x507f24 0x50b053 0x634dd2 0x634e87 0x63863f 0x6391e1 0x4b0dc0 0x7f30d56ffb97 0x5b26fa --- create EDRanker model --- prerank model --- create NTEE model --- --- create AbstractWordEntity model --- main model create new model --- create MulRelRanker model --- --- create LocalCtxAttRanker model --- --- create AbstractWordEntity model --- training... extracting training data 108 222 288 112 182 211 105 105 recall 1.0

train docs 953

277 recall 0.9772490085577124 aida-A #dev docs 218 108 114 recall 0.9866220735785953 aida-B #dev docs 232 recall 0.9847560975609756 msnbc #dev docs 20 recall 0.9408528198074277 aquaint #dev docs 50 recall 0.914396887159533 ace2004 #dev docs 35 recall 0.9190424959655729 clueweb #dev docs 320 recall 0.93214074512123 wikipedia #dev docs 318 creating optimizer att_mat_diag tok_score_mat_diag entity2entity_mat_diag entity2entity_score_mat_diag knowledge2entity_mat_diag knowledge2entity_score_mat_diag type_emb cnn.weight cnn.bias score_combine.0.weight score_combine.0.bias score_combine.3.weight score_combine.3.bias tensor([274474], device='cuda:0')

After this, the cell is still executing for the past 2 hours and memory usage is changing frequently. Still, I don't see epoch 0 results.

I am very much confused about what is going on in this case. :-(

YoungXiyuan commented 4 years ago

Thank you for your quick reply.

First, I am sure that the DCA framework does not run successfully on GPU card according to your provided log information on the screen, because the loss information should be printed after each mini-batch being processed. I think you could find that printed loss information when you run the DCA framework on CPU.

Then I guess the reason for your observed phenomenon that memory usage is changing frequently, may could be the unsuitable python environment you adopt. Please check that you have installed the exact Pytorch GPU version rather than Pytorch CPU version.

As to the above-said error when running on CPU, I have to say that error looks so strange. The error information illustrates that "TypeError: Object of type 'set' is not JSON serializable", but after I check the potential error code segement in the file "abstract_word_entity.py" (#69 ~ #78), I find that "config" variable are truly a dict rather than a set.

I guess that error is occured due to the unstable system environment, and I suggest that you could try it again and observe the subsequent status.

SravyaMadupu commented 4 years ago

Thank you so much for all the help. I would try again on another system and get back to you. Just curious how much time does it take to run the code for your configuration?

YoungXiyuan commented 4 years ago

Sorry for my late reply.

According to my impression, we spent about half a day training the DCA framework (Supervised Learning) for about 150 epoches, based on the above-said hardware configuration and default framework parameters.

SravyaMadupu commented 4 years ago

I was not able to replicate the results even after all the changes. I have saved the state_dict file after the 55th epoch and passed it to the model again with a changed learning rate and made some changes JSON.dump function. Now it is working. Thank you so much for all the help. I have seen the previous issues also. You are being really helpful and very quick in replies. Appreciate your effort. :-D