Closed jvwong closed 2 years ago
I looked at the latest database dump for today (factoid_dump_2021-12-07_08-28-44-555.tar.gz) and simply did a sanity check on the amount of PubMed information being stored. Generally speaking there are two categories
Document-level papers
Network-level related papers: Max of N=10 papers allowed for each
I did a rethinkdb query to sum up the number of papers for the document, entities, interactions and the total for each document (plots below). Take home message is that there isn't anything suspicious happening for any document.
Here's the rethinkdb query I used just for sake of reference:
// Sum over all related Papers
r.db('factoid')
.table('document')
//.filter({ id: 'd57eaba8-05e7-4412-80bd-9a0295d075b1'})
.filter({ status: 'public'})
.map( function( document ){
return document.merge({
referencedPapers: document( 'referencedPapers' )( 'pmid' ).count().default(0),
relatedPapers: document( 'relatedPapers' )( 'pmid' ).count().default(0)
});
})
.merge( function( document ) {
return {
docPapers: document('relatedPapers').add( document( 'referencedPapers' ) )
}
})
.map( function( document ){
return document.merge({ entries: document( 'entries' )( 'id' ) });
})
// interactions
.merge( function( document ) {
return {
interactions:
r.db('factoid').table('element')
.getAll( r.args( document( 'entries' ) ) )
.coerceTo( 'array' )
.pluck( 'id', 'association', 'type', 'name', 'entries','relatedPapers' )
.filter(function (element) {
return element('type').eq('ggp').not()
.and( element('type').eq('dna').not() )
.and( element('type').eq('rna').not() )
.and( element('type').eq('protein').not() )
.and( element('type').eq('chemical').not() )
.and( element('type').eq('complex').not() )
})
.merge( function( interaction ){
return {
relatedPapers: interaction('relatedPapers').count().default(0)
}
})
.pluck('id', 'relatedPapers' )
};
})
.map( function( document ){
return document.merge({ interactionPapers: document( 'interactions' )( 'relatedPapers' ).sum() });
})
// entities
.merge( function( document ) {
return {
entities:
r.db('factoid').table('element')
.getAll( r.args( document( 'entries' ) ) )
.coerceTo( 'array' )
.pluck( 'id', 'association', 'type', 'name', 'entries','relatedPapers' )
.filter(function (element) {
return element('type').eq('ggp')
.or( element('type').eq('dna') )
.or( element('type').eq('rna') )
.or( element('type').eq('protein') )
.or( element('type').eq('chemical') )
.or( element('type').eq('complex') )
})
.merge( function( interaction ){
return {
relatedPapers: interaction('relatedPapers').count().default(0)
}
})
.pluck('id', 'relatedPapers' )
}
})
.map( function( document ){
return document.merge({ entityPapers: document( 'interactions' )( 'relatedPapers' ).sum() });
})
// totals
.merge( function( document ) {
return {
papers: document('docPapers').add( document('entityPapers') ).add( document('interactionPapers') )
}
})
.orderBy(r.desc('papers'))
.pluck('id','secret', 'docPapers', 'interactionPapers', 'entityPapers', 'status', 'papers', 'createdDate' )
One major cause of the large Rethinkdb database dump size is the model attribute _ops
: This tracks updates to the model (i.e. document, element) and can include the full data that was involved in each modification.
Take, for instance, the latest dump of prod (rethinkdb_dump_2021-12-19T19/27/23
) versus the same data with an empty _ops
(no _ops):
File | rethinkdb_dump_2021-12-19T19/27/23 (MB) | no _ops (MB) |
---|---|---|
element.json | 566.3 | 29.7 |
document.json | 421 | 31.7 |
tar.gz (compressed) | 189.1 | 14.7 |
I think the ultimate source of the bulk is article data, and this can be exacerbated in cases where the article information for a model is updated (e.g CRON job or manually) since the article data is appended in _ops
.
Looking at size of files for a single Biofactoid entry/article (id: 7f1a6d34-2efa-4f45-bba3-f7e653ce9a96
) compared to just the _ops data file size:
File | total (MB) | _ops data (MB) |
---|---|---|
element.json | 1.8 | 1.3 |
document.json | 1.8 | 1.4 |
The ops could be pruned, but they're valuable data (automatic user action feedback) and storage is cheap.
Pruning could work like this: Docs that are static can have their ops archived into separate files (e.g. some-doc-id.json) and removed from the main DB.
The ops could be pruned, but they're valuable data (automatic user action feedback) and storage is cheap.
The issue I'm noticing is the growing amount of time my computer is taking to restore (minutes), even with this tiny data.
Pruning could work like this: Docs that are static can have their ops archived into separate files (e.g. some-doc-id.json) and removed from the main DB.
OK, will consider.
Not sure if its related, I noticed that elements
removed from a document
would still be referenced in the document._ops
- otherwise they would be completely orphaned, that is not referenced by anything else.
I figure the backup process will look something like this:
factoid_dump_2021-01-07T00/00/00.tar.gz
_ops
attribute reset to an empty array (e.g. factoid_dump_noops_2021-01-07T00/00/00.tar.gz
)As such, this belongs in https://github.com/PathwayCommons/rethinkdb-backup as a step post-dump.
Even after slimming down the RethinkDB database (as described above), it was experiencing the same old problems: Increased resource usage, larger dump size and this was affecting performance. My running hypothesis at the underlying problem is the indiscriminate use of the CRON job to refresh the document data, in particular, the PubMed related info, which is accumulating in the document object (i.e. _ops
).
An empirical experiment, submitting a single doc to a local instance with an empty database and running the cron a few times. Looking at the database dump archive size (17 documents; 1 status=public):
Condition | Size (MB) | Size (bytes) | Delta |
---|---|---|---|
Initial | 1.40 | 1395419 | |
Cron 1 | 1.77 | 1773977 | 27.1% |
Cron 2 | 2.16 | 2156386 | 21.6% |
Cron 3 | 2.53 | 2526959 | 17.2% |
Another experiment, this time populating a local instance with the Biofactoid data dump from March 1, 2022 (663 documents; 121 status=public):
Condition | Size (MB) | Size (bytes) | Delta |
---|---|---|---|
Base | 104.76 | 104759158 | |
*CRON 1 (ongoing) | 246.22 | 246224854 | 135% |
*Note that this CRON job has been running ~a day and has yet to finish (Update: finished in about 24 hours).
Take home message: Modify the CRON so that it is used in a very sparing manner.: refresh only those documents that really need it (e.g. devoid of data - someone added a paper title that's not yet in PubMed).
On a related note: Do we need to store all of the PubMed information for each article in the database in this case?
The underlying issue is the rather large size of Rethinkdb database dumps. I think there was an old issue where we looked at the reason for the size of the docs: https://github.com/PathwayCommons/factoid/issues/939
Originally posted by @jvwong in https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898502770
https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898788834
https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898793784