dice-group / WHALE

0 stars 0 forks source link

Main memory overloading when training using DICE-embeddings library #1

Open sshivam95 opened 3 weeks ago

sshivam95 commented 3 weeks ago

The RAM is getting overloaded because the unique entities and relations are stored in RAM memory on GPU nodes of Noctua 1 (180GB usable main memory) and on Noctua 2 (470GB usable main memory). This leads to an Out of Memory (OOM) error in the SLURM.

sshivam95 commented 3 weeks ago

A solution to point 1 is to generate the indices of unique entities and relations before hand and convert the dataset into an index transformed dataset. Idea of incremental saving: #2 To avoid memory kill issue, once the shape of a numpy.memmap reaches a threshold (say 1 million triples), dump it in a backup file (initially, a .pickle file) and clear the memory mapped variable. Once the memory mapped variable reads the next round of threshold, the data from the pickle file is updated with the entries from the new memap variable. This updates the data mapping in the pickle file without use of any variable overloading the RAM memory. This reduces the use of the RAM and avoids a memory kill error.

sshivam95 commented 3 weeks ago

Initially, ran individual tests on different portions of the dataset to test this approach in a pickle file. It works for smaller datasets up to 2 million triples but fails after that.

sshivam95 commented 3 weeks ago

Alternative solution: Issue 2 comment

sshivam95 commented 3 weeks ago

Another proposal is to use mmappickle library which is designed for “unstructured '' parallel access, with a strong emphasis on adding new data. #4

Issues: - The indexing is done directly to a memory mapped file in the form of dictionaries using mmappickle.mmapdict

This process of writing to a memory mapped file in the Parallel File System of Noctua clusters is very slow because lustre has a very bad management for memory mapped files. #5

sshivam95 commented 3 weeks ago

Another solution is to use the DGX partition nodes which have $~10$ TB local fast NVME-SSD storage and $8$ GPUs . either use the in-memory or SSDs.

After running the training test on 1 chunk (10 million triples) using dice-embedding library, we get the following file sizes:

The estimated size of files for full dataset ($57$ billion triples):

sshivam95 commented 2 weeks ago

A workaround is to create indexed train_set.npy before hand rather than making dice-embeddings to create them using B+ tree implementation in C++.

sshivam95 commented 2 weeks ago

Update: Issue #9 creates a workaround for training embedding models from individual graphs by splitting the dataset based on domain. A domain for a triple is defined as the authority base URL in their namespace. Split the dataset by creating different dataset files based on domain names and then train the models based on these small graphs.