Closed sshivam95 closed 5 months ago
Issue: To update the pickle file, there is no direct functionality to update the file itself. To update the data in the file, it needs to be loaded first in a variable and then updated with the new data. This results in the same RAM overshooting problem.
Alternative solution: using a (key, value) database like shelve
to store the indices. Commit
New tests are running which worked successfully on small portions of the dataset (745 million triples). However, the reading of the whole dataset is very slow. A current test run on the whole dataset is running for 3 days and still has not read 5% of the data.
Usage of mmappickle.mmapdict
showed progress on smaller triple size file #4
From the issue #1 comment, this approach will use incremental saving on pickle files. It will create a dictionary in main memory upto a threshold triples, e.g., 10 million (1 chunk), then dump it all in a
pickle
file.