PathwayCommons / factoid

A project to capture biological pathway data from academic papers
https://biofactoid.org
MIT License
28 stars 7 forks source link

Examining the database data / Root of large data dump size #1024

Closed jvwong closed 2 years ago

jvwong commented 2 years ago

On a related note: Do we need to store all of the PubMed information for each article in the database in this case?

The underlying issue is the rather large size of Rethinkdb database dumps. I think there was an old issue where we looked at the reason for the size of the docs: https://github.com/PathwayCommons/factoid/issues/939

Originally posted by @jvwong in https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898502770

https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898788834

If the results are cached, the user sees everything in the UI immediately (Pubmed metadata, related papers, etc.). If queries are slow or unreliable, then caching helps to alleviate those issues and improve UX. Might also some parts of the app assume that all submitted docs have full Pubmed metadata? For backups, I suppose we could specifically omit cached data. However, it could be slow or problematic to requery for that info in the event that we need to restore a backup. This may be an area to use varied backup levels over time: The most recent backup may contain everything, but older backups may omit the caches. For now, is this a high priority? Disk space is cheap. Complexity and time are expensive On Aug 13, 2021, at 10:37, Jeffrey @.***> wrote:  On a related note: Do we need to store all of the PubMed information for each article in the database in this case? The underlying issue is the rather large size of Rethinkdb database dumps. I think there was an old issue where we looked at the reason for the size of the docs: #939 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

https://github.com/PathwayCommons/factoid/issues/937#issuecomment-898793784

This is a low priority, but a first step would be to inventory the data stored and make sure it is real/important.

jvwong commented 2 years ago

I looked at the latest database dump for today (factoid_dump_2021-12-07_08-28-44-555.tar.gz) and simply did a sanity check on the amount of PubMed information being stored. Generally speaking there are two categories

  1. Document-level papers

    • a) 'referencedPapers' which are used to notify authors their article was cited in a Biofactoid entry (max N=100)
    • b) 'relatedPapers' which are consist of ranked PubMed papers (max N=10)
  2. Network-level related papers: Max of N=10 papers allowed for each

    • a) 'entity'
    • b) 'interaction'

I did a rethinkdb query to sum up the number of papers for the document, entities, interactions and the total for each document (plots below). Take home message is that there isn't anything suspicious happening for any document.

Screen Shot 2021-12-07 at 3 00 52 PM
jvwong commented 2 years ago

Here's the rethinkdb query I used just for sake of reference:

// Sum over all related Papers
r.db('factoid')
  .table('document')
  //.filter({ id: 'd57eaba8-05e7-4412-80bd-9a0295d075b1'})
  .filter({ status: 'public'})
  .map( function( document ){
    return document.merge({
      referencedPapers: document( 'referencedPapers' )( 'pmid' ).count().default(0),
      relatedPapers: document( 'relatedPapers' )( 'pmid' ).count().default(0)
    });
  })
  .merge( function( document ) {
    return {
      docPapers: document('relatedPapers').add( document( 'referencedPapers' ) ) 
    }
  })
  .map( function( document ){
    return document.merge({ entries: document( 'entries' )( 'id' ) });
  })

  // interactions  
  .merge( function( document ) {
    return {
      interactions:
        r.db('factoid').table('element')
          .getAll( r.args( document( 'entries' ) ) )
          .coerceTo( 'array' )
          .pluck( 'id', 'association', 'type', 'name', 'entries','relatedPapers' )
          .filter(function (element) {
            return  element('type').eq('ggp').not()
              .and( element('type').eq('dna').not() )
              .and( element('type').eq('rna').not() )
              .and( element('type').eq('protein').not() )
              .and( element('type').eq('chemical').not() )
              .and( element('type').eq('complex').not()  )
          })
          .merge( function( interaction ){
            return {
              relatedPapers: interaction('relatedPapers').count().default(0)
            }
          })
          .pluck('id', 'relatedPapers' )
    };
  })
  .map( function( document ){
    return document.merge({ interactionPapers: document( 'interactions' )( 'relatedPapers' ).sum() });
  })

  // entities  
  .merge( function( document ) {
    return {
      entities:
        r.db('factoid').table('element')
          .getAll( r.args( document( 'entries' ) ) )
          .coerceTo( 'array' )
          .pluck( 'id', 'association', 'type', 'name', 'entries','relatedPapers' )
          .filter(function (element) {
            return  element('type').eq('ggp')
              .or( element('type').eq('dna') )
              .or( element('type').eq('rna') )
              .or( element('type').eq('protein') )
              .or( element('type').eq('chemical') )
              .or( element('type').eq('complex') )
          })
          .merge( function( interaction ){
            return {
              relatedPapers: interaction('relatedPapers').count().default(0)
            }
          })
          .pluck('id', 'relatedPapers' )
    }
  })  
  .map( function( document ){
    return document.merge({ entityPapers: document( 'interactions' )( 'relatedPapers' ).sum() });
  })

  // totals  
  .merge( function( document ) {
    return {
      papers: document('docPapers').add( document('entityPapers') ).add( document('interactionPapers') )
    }
  })       
  .orderBy(r.desc('papers'))
  .pluck('id','secret', 'docPapers', 'interactionPapers', 'entityPapers', 'status', 'papers', 'createdDate' )
jvwong commented 2 years ago

One major cause of the large Rethinkdb database dump size is the model attribute _ops: This tracks updates to the model (i.e. document, element) and can include the full data that was involved in each modification.

Take, for instance, the latest dump of prod (rethinkdb_dump_2021-12-19T19/27/23) versus the same data with an empty _ops (no _ops):

File rethinkdb_dump_2021-12-19T19/27/23 (MB) no _ops (MB)
element.json 566.3 29.7
document.json 421 31.7
tar.gz (compressed) 189.1 14.7

I think the ultimate source of the bulk is article data, and this can be exacerbated in cases where the article information for a model is updated (e.g CRON job or manually) since the article data is appended in _ops.

jvwong commented 2 years ago

Looking at size of files for a single Biofactoid entry/article (id: 7f1a6d34-2efa-4f45-bba3-f7e653ce9a96) compared to just the _ops data file size:

File total (MB) _ops data (MB)
element.json 1.8 1.3
document.json 1.8 1.4
maxkfranz commented 2 years ago

The ops could be pruned, but they're valuable data (automatic user action feedback) and storage is cheap.

Pruning could work like this: Docs that are static can have their ops archived into separate files (e.g. some-doc-id.json) and removed from the main DB.

jvwong commented 2 years ago

The ops could be pruned, but they're valuable data (automatic user action feedback) and storage is cheap.

The issue I'm noticing is the growing amount of time my computer is taking to restore (minutes), even with this tiny data.

Pruning could work like this: Docs that are static can have their ops archived into separate files (e.g. some-doc-id.json) and removed from the main DB.

OK, will consider.

Not sure if its related, I noticed that elements removed from a document would still be referenced in the document._ops - otherwise they would be completely orphaned, that is not referenced by anything else.

jvwong commented 2 years ago

I figure the backup process will look something like this:

As such, this belongs in https://github.com/PathwayCommons/rethinkdb-backup as a step post-dump.

jvwong commented 2 years ago

Even after slimming down the RethinkDB database (as described above), it was experiencing the same old problems: Increased resource usage, larger dump size and this was affecting performance. My running hypothesis at the underlying problem is the indiscriminate use of the CRON job to refresh the document data, in particular, the PubMed related info, which is accumulating in the document object (i.e. _ops).

An empirical experiment, submitting a single doc to a local instance with an empty database and running the cron a few times. Looking at the database dump archive size (17 documents; 1 status=public):

Condition Size (MB) Size (bytes) Delta
Initial 1.40 1395419
Cron 1 1.77 1773977 27.1%
Cron 2 2.16 2156386 21.6%
Cron 3 2.53 2526959 17.2%

Another experiment, this time populating a local instance with the Biofactoid data dump from March 1, 2022 (663 documents; 121 status=public):

Condition Size (MB) Size (bytes) Delta
Base 104.76 104759158
*CRON 1 (ongoing) 246.22 246224854 135%

*Note that this CRON job has been running ~a day and has yet to finish (Update: finished in about 24 hours).


Take home message: Modify the CRON so that it is used in a very sparing manner.: refresh only those documents that really need it (e.g. devoid of data - someone added a paper title that's not yet in PubMed).