PathwayCommons / factoid

A project to capture biological pathway data from academic papers
https://biofactoid.org
MIT License
27 stars 7 forks source link

Size of the RethinkDB database is expanding at a rapid rate #1276

Open jvwong opened 2 weeks ago

jvwong commented 2 weeks ago

The size of the RethinkDB data dump is quite large (>200 MB). It is problematic because it lengthens the duration of a data dump or restore (order of minutes now). This is all despite the fact that there are less than 400 public Documents.

I did a bit of experimenting with a recent dump by removing various items and looking at the dump size (.tar.gz) summarized in Table 1. It appears that pruning the _ops field (which stashes every action performed) and the relatedPapers can drop the size almost 88%. Given this, some possible solutions to keep the size reasonable:

Related: https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898502770

Table 1. Biofactoid RethinkDB dump file sizes Description Size (MB) % Change Comments
*Full DB (june 19, 2024) 202 0.0% Counts: 4649 Documents and 6813 Elements
**Remove _ops 96 -52.5%
**Remove relatedPapers 131 -35.1%
Remove trashed and initiated Documents 173 -14.4% Removed 4226 Documents and 2192 Elements

*Dump archive: factoid_dump_2024-06-19_14-28-33-767.tar.gz **Actions applied to Document and Element table