dalab / deep-ed

Source code for the EMNLP'17 paper "Deep Joint Entity Disambiguation with Local Neural Attention", https://arxiv.org/abs/1704.04920
Apache License 2.0
224 stars 50 forks source link

Out of Memory when training All Wikipedia entities using GPU #19

Closed hitercs closed 5 years ago

hitercs commented 5 years ago

Hi,

Thanks for your work. I want to know the maximum GPU memory consumption when training all wikipedia entities. I have tried using single Tesla P100 (16G) and 4 x Tesla M60 (4 x 8 = 32G). They both result in Out of Memory Error. Can you estimate the maximum GPU memory consumption when training all wikipedia entities?

Thanks.

hitercs commented 5 years ago

What is strange is that the code still results in Out of Memory Error even if I set batch size as 1.

hitercs commented 5 years ago

I run with the following setting, CUDA_VISIBLE_DEVICES=0,1,2,3 th entities/learn_e2v/learn_a.lua -root_data_dir $DATA_PATH -entities ALL -batch_size 1|& tee log_train_entity_vecs

The log message is,

===> RUN TYPE: cudacudnn ==> switching to CUDA (GPU) Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.5==> Loading relatedness validate ---> from t7 file. ==> Loading relatedness test ---> from t7 file. ==> Loading relatedness thid tensor ---> from t7 file. Done loading relatedness sets. Num queries test = 3319. Num queries valid = 3673. Total num ents restricted set = 276031 ==> Loading entity wikiid - name map ---> from t7 file: data/generated/ent_name_id_map.t7 Done loading entity name - wikiid. Size thid index = 4306070 ==> Loading common w2v + top freq list of words ---> from t7 file. ==> Loading word freq map with unig power 0.6 Done loading word freq index. Num words = 491413; total freq = 774609376 ==> Loading w2v vectors ---> from t7 file. Done reading w2v data. Word vocab size = 491413

==> Init entity embeddings matrix. Num ents = 4306070 Init entity embeddings with average of title word vectors to speed up learning. Done init. THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2735/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory

hitercs commented 5 years ago

By the way, when I train with CPU mode, it indicates that the EDT time is about 6-7 days. By inspecting the code, the time is estimated from 4 passes over entire Wikipedia. Is it sufficient for training ALL entities ?, as you training entities belonging to standard datasets more than 69 epoches. Since in your paper, it stated that different entities are trained independently. Do you have any suggestion for scalable training the whole wikipedia entity embeddings? Thanks a lot.

octavian-ganea commented 5 years ago

Thanks for your interest in our work!

First, I have to say that it's been long since I touched this code, so I might not remember everything well. Moreover, some variable names are a bit unfortunate as you can see below, for which I'm sorry ...

There are not 4 passes over the entire Wiki, but way more. In the code for training entity embeddings, I do batch updates, where each example is an entity hyperlink appearing in Wikipedia together with its contextual words in a window of fixed size (parameter num_words_per_ent = 20). The optimal parameters for this are shown in step 14 of the README. I only use hyperlinks for the entities in the RLTD set which contained 276031 entities appearing in the test and validation datasets. Please also see: https://github.com/dalab/deep-ed/issues/9 . For the full training of these entity embeddings, I do 69 epochs, each epoch containing num_batches_per_epoch = 2000 batches. Each minibatch has size 500, so 500 hyperlinks and their contexts. The train_size = 17000000 parameter you saw it's just for displaying progress, not the full training (in fact, the training loop runs indefinitely, but I stopped it after 70 epochs).

The code was never tested on multiple GPUs, I doubt it would work without modifications.

Concerning OOM, this code is not optimized enough to work with 4M entities on GPU. The 'ALL' option cannot be used with GPUs with the current embedding size of 300. Keep in mind that the lookup table memory has to be allocated twice: once for the actual table and once for its gradient.

Hope it helps,

octavian-ganea commented 5 years ago

Unless I didn't miss anything, the current lookup table and its gradient table should occupy: 2 276031 300 (emb size) * 4 (floats) bytes

hitercs commented 5 years ago

Thanks for your reply! In order to train entire Wikipedia entity embedding using this code, do you think it is possible to divide entire set of Wikipedia entities into several partitions where each partitions contains roughly 200K entities and then train each of partition separately on different GPUs? Thanks.

octavian-ganea commented 5 years ago

I actually believe the partitions can be much larger, like > 1M. Yes, that would definitely be possible, I just don't have code for this ... But any contribution would be very appreciated!

hitercs commented 5 years ago

Great, thanks.

Jorigorn commented 5 years ago

Hi @hitercs It would be nice you can share the training code, I would also like to train the whole Wikipedia entity vector and my private entity vectors. Thanks.

Jorigorn commented 5 years ago

Hi @octavian-ganea I am wondering if it make sense to replace this component with https://github.com/wikipedia2vec/wikipedia2vec ?

I would like to train my private entity embedding, for example, company name entities.

Thanks a lot. :)

octavian-ganea commented 5 years ago

I didn't try to use these wikipedia2vec embeddings in conjunction with our entity disambiguation network, but I do compare against this paper (see for example table 1 in our paper).

However, this wikipedia2vec approach is more expensive to train than ours, since it requires three Word2Vec like loss functions and also entity-entity co-occurrence statistics in order to work well. So one needs to train the full universe of all entity embeddings at once, which might be bad if one does not have enough memory or enough data for specific entities, or simply if one is interested only in some specific subset of entities.

In comparison, our method trains each entity embedding separately from other entities. It relies only on entity-word co-occurrence statistics which are cheaper to obtain (for example one can train entity embeddings solely based on a entity description text, without relying to entity - entity co-occurrence statistics). Moreover, even if it uses less data, in table 1 of our paper we show that it outperforms the wikipedia2vec approach. However, our code is unfortunately not optimized and tailored enough to be used out-of-the-box on new sets of entities, but with a bit of help from the community this can easily become possible. Unfortunately, in my PhD I switched to different topics and I am no longer maintaining this codebase ...

Hope it helps,

hitercs commented 5 years ago

Hi @octavian-ganea, I can train whole Wikipedia entity embedding by splitting them into five partitions (each contains more than 1M entities). Each partition can be fitted in single 8GB GPU card. However, since I am not familiar with the Lua language, the code is tailored for my application. So I can't create a pull request with general purpose application. Sorry about that. Thanks a lot.