dice-group / WHALE

0 stars 0 forks source link

Incremental saving approach #2

Closed sshivam95 closed 2 weeks ago

sshivam95 commented 2 weeks ago

From the issue #1 comment, this approach will use incremental saving on pickle files. It will create a dictionary in main memory upto a threshold triples, e.g., 10 million (1 chunk), then dump it all in a pickle file.

sshivam95 commented 2 weeks ago

Issue: To update the pickle file, there is no direct functionality to update the file itself. To update the data in the file, it needs to be loaded first in a variable and then updated with the new data. This results in the same RAM overshooting problem.

sshivam95 commented 2 weeks ago

Alternative solution: using a (key, value) database like shelve to store the indices. Commit

New tests are running which worked successfully on small portions of the dataset (745 million triples). However, the reading of the whole dataset is very slow. A current test run on the whole dataset is running for 3 days and still has not read 5% of the data.

sshivam95 commented 2 weeks ago

Usage of mmappickle.mmapdict showed progress on smaller triple size file #4

sshivam95 commented 2 weeks ago

Issues with mmapickle.mmapdict