awslabs / dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
https://dglke.dgl.ai/doc/
Apache License 2.0
1.27k stars 195 forks source link

Need DGL-KE to optimize its memory overhead during training. #220

Closed cdpierse closed 3 years ago

cdpierse commented 3 years ago

Hi there,

Thank you for all the fantastic work you are all doing on this library, it's been a huge help for me and the command line tooling is very easy to get up and running. I'm very excited to see what the future holds for dglke's development.

I was wondering if you could help me with some OOM issues I've been running into, my machine has 126GB of RAM available which I thought would be enough for training however I frequently run into the situation where the process gets killed due to memory issues. It seems that memory usage is heavily related to the num_proc which has me confused as from what I read in the main paper I thought each CPU process would access a shared memory pool for entity and relation embeddings.

I read the comments on this issue #162 and tried to calculate the memory consumption using the formulas given and compared it with the memory consumption being experienced.

In my dataset there are ~45 million entities, 1 relation, and ~1 billion edges and I am currently trying to train with a hidden size of 250 but I would like to go up to 400 which should theoretically be possible with my RAM. Below are the calculations I put together for what I expected my memory usage to look like.

entity_embed_mem_size = 45,000,000  * 4 * 250 / 1024 / 1024 / 1024 = 42gb
relation_embed_mem_size = 1  * 4 * 250 / 1024 / 1024 / 1024 = ~0gb
graph_mem_size =  1,000,000,000 * 2 * 8 / 1024 / 1024 / 1024 = 15gb

Expected memory usage = 56gb

It's also worth noting that my dataset is rawudd{hrt} and I am hoping to try and convert it to udd_{hrt} as there might be a chance that the data processed from the raw_udd isn't being garbage collected correctly and thus taking up space.

I'd really appreciate any insights or help on this. Thanks.

classicsong commented 3 years ago

During training, there are also memory costs for storing the graph structure and shuffled edge list.

cdpierse commented 3 years ago

Thanks for responding @classicsong, I expected that there would be some additional overhead on top of the entities etc however whenever I train I regularly hit ~120gb on the dataset which should be using 56gb, should overhead such as this be expected or is this normal ?

I can just about get by as things stand thankfully when training on my machine, however I am now limited in my ability to increase the hidden size.

Screenshot 2021-06-19 at 19 00 51
classicsong commented 3 years ago

Yes. We create two many temporal objects during training. We will optimize the memory management of DGL-KE.

cdpierse commented 3 years ago

@classicsong thanks, I think dgl_eval would also benefit from similar optimizations in memory. I've been going through the source for train trying to find what could be taking up space in memory needlessly, if you have any idea of where to look I could possibly try and put in a PR ?

classicsong commented 3 years ago

Definitely you can try to fix it and contribute into DGL-KE. There are two places you can look at: 1) There are train dataset, eval dataset and test dataset (https://github.com/awslabs/dgl-ke/blob/b4e57016d5715429377d5aab79e88c451dc543f5/python/dglke/dataloader/sampler.py#L315), it is to share the edge list accross these datasets; 2) in the main function, we can check if there is any object we can release earlier by calling del.