Closed cdpierse closed 3 years ago
During training, there are also memory costs for storing the graph structure and shuffled edge list.
Thanks for responding @classicsong, I expected that there would be some additional overhead on top of the entities etc however whenever I train I regularly hit ~120gb on the dataset which should be using 56gb, should overhead such as this be expected or is this normal ?
I can just about get by as things stand thankfully when training on my machine, however I am now limited in my ability to increase the hidden size.
Yes. We create two many temporal objects during training. We will optimize the memory management of DGL-KE.
@classicsong thanks, I think dgl_eval would also benefit from similar optimizations in memory. I've been going through the source for train trying to find what could be taking up space in memory needlessly, if you have any idea of where to look I could possibly try and put in a PR ?
Definitely you can try to fix it and contribute into DGL-KE. There are two places you can look at: 1) There are train dataset, eval dataset and test dataset (https://github.com/awslabs/dgl-ke/blob/b4e57016d5715429377d5aab79e88c451dc543f5/python/dglke/dataloader/sampler.py#L315), it is to share the edge list accross these datasets; 2) in the main function, we can check if there is any object we can release earlier by calling del.
Hi there,
Thank you for all the fantastic work you are all doing on this library, it's been a huge help for me and the command line tooling is very easy to get up and running. I'm very excited to see what the future holds for dglke's development.
I was wondering if you could help me with some OOM issues I've been running into, my machine has 126GB of RAM available which I thought would be enough for training however I frequently run into the situation where the process gets killed due to memory issues. It seems that memory usage is heavily related to the
num_proc
which has me confused as from what I read in the main paper I thought each CPU process would access a shared memory pool for entity and relation embeddings.I read the comments on this issue #162 and tried to calculate the memory consumption using the formulas given and compared it with the memory consumption being experienced.
In my dataset there are ~45 million entities, 1 relation, and ~1 billion edges and I am currently trying to train with a hidden size of 250 but I would like to go up to 400 which should theoretically be possible with my RAM. Below are the calculations I put together for what I expected my memory usage to look like.
It's also worth noting that my dataset is rawudd{hrt} and I am hoping to try and convert it to udd_{hrt} as there might be a chance that the data processed from the raw_udd isn't being garbage collected correctly and thus taking up space.
I'd really appreciate any insights or help on this. Thanks.