dice-group / WHALE

0 stars 0 forks source link

Test on memory map approach using `mmappickle.mmapdict` on `/dev/shm` instead of PFS on Clusters #5

Closed sshivam95 closed 5 months ago

sshivam95 commented 5 months ago

The memory map approach is taking a lot of time to process the index dictionary in the memory mapped file. It took $3$ days to process $41,602$ entities out of $5,037,674$ in a chunk of 10 million triples.

sshivam95 commented 5 months ago

4 runs on the parallel file system on Noctua clusters with uses lustre. After a discussion with them, it turns out that lustre has a very bad memory management when it comes to memory mapped file. Therefore, storing the memory mapped files in /dev/shm folder should do the trick

sshivam95 commented 5 months ago

Update: the write on memory mapped pickle dictionary in /dev/shm is way faster than lustre but still comparatively very slow. It took $1$ day to process $146,893$ entities out of $5,037,674$.

Way faster than lustre but very slow overall.

sshivam95 commented 5 months ago

Alternate solution, create a B+ tree implementation in C++

sshivam95 commented 5 months ago

Update: Might not be needing this approach if using domain specific datasets under issue #9