The size of the RethinkDB data dump is quite large (>200 MB). It is problematic because it lengthens the duration of a data dump or restore (order of minutes now). This is all despite the fact that there are less than 400 public Documents.
I did a bit of experimenting with a recent dump by removing various items and looking at the dump size (.tar.gz) summarized in Table 1. It appears that pruning the _ops field (which stashes every action performed) and the relatedPapers can drop the size almost 88%. Given this, some possible solutions to keep the size reasonable:
_ops: Periodically prune on a manual basis on some local dump then restore
relatedPapers: Store PMIDs (rather than full paper details) and let the browser retrieve these on-demand
We do this with app-ui and it is pretty reasonable
The size of the RethinkDB data dump is quite large (>200 MB). It is problematic because it lengthens the duration of a data dump or restore (order of minutes now). This is all despite the fact that there are less than 400 public Documents.
I did a bit of experimenting with a recent dump by removing various items and looking at the dump size (.tar.gz) summarized in Table 1. It appears that pruning the
_ops
field (which stashes every action performed) and therelatedPapers
can drop the size almost 88%. Given this, some possible solutions to keep the size reasonable:_ops
: Periodically prune on a manual basis on some local dump then restorerelatedPapers
: Store PMIDs (rather than full paper details) and let the browser retrieve these on-demandRelated: https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898502770
*Dump archive: factoid_dump_2024-06-19_14-28-33-767.tar.gz **Actions applied to Document and Element table