WormBase / pseudoace

Modelling the WormBase ACeDB database in datomic.
4 stars 3 forks source link

Feature/homol import #82

Closed mgrbyte closed 4 years ago

mgrbyte commented 4 years ago

Implements #81 (No rush to merge this - prefer we test the db first; can make any amendments to this PR)

Converts homology from ACeDB to a seperate database, for Motif and Protein classes only. This is accomplished by creating "stub" entities for all motif and protein objects in the new database, then converting the motif and homology "locatable" entities associated with each, using a variation of the original's code in locatable_import.clj.

This requires adding 5 new commands to the migration process (but they could be run in parallel or at any time that the source ACeDB database is made available)

By default, this new datomic database is stored in the same DynamoDB table.

Below is some examples of using the new database.

(d/q '[:find (count ?e) .
             :where [?e :homology/protein _]] db)
-> 17697895
(d/q '[:find (count ?e) .
             :where [?e :homology/motif _]] db)
-> 1894084
;; Example query of homology db:
(d/q '[:find ?pid ?method ?min ?max ?score
             :in $hdb ?pid
             :where 
             [$hdb ?e :protein/id ?pid]
             [$hdb ? :homology/protein ?e]
             [$hdb ?lp :locatable/parent ?e]
             [$hdb ?lp :locatable/method ?mid]
             [$hdb ?mid :method/id ?method]
             [$hdb ?lp :locatable/max ?min]
             [$hdb ?lp :locatable/max ?max]
             [$hdb ?lp :locatable/score ?score]]
           hdb
           "RP01893")
;; ... results in:
#{["RP01893" "wublastp_human" 146 146 62.09691] ["RP01893" "wublastp_slimSwissProt" 149 149 65.39794] ["RP01893" "wublastp_fly" 152 152 71.69897] ["RP01893" "hmmpanther" 151 151 254.4] ["RP01893" "wublastp_brugia" 73 73 17.0] ["RP01893" "wublastp_slimSwissProt" 149 149 64.0] ["RP01893" "wublastp_slimSwissProt" 151 151 63.39794] ["RP01893" "wublastp_pristionchus" 149 149 82.52288] ["RP01893" "wublastp_ovolvulus" 142 142 67.09691] ["RP01893" "wublastp_remanei" 153 153 105.2218] ["RP01893" "pfam" 151 151 195.5] ["RP01893" "wublastp_brugia" 83 83 20.0] ["RP01893" "wublastp_brugia" 151 151 54.39794] ["RP01893" "wublastp_human" 131 131 58.69897] ["RP01893" "wublastp_human" 149 149 68.30103] ["RP01893" "wublastp_slimSwissProt" 150 150 61.69897] ["RP01893" "wublastp_brugia" 153 153 72.52288] ["RP01893" "wublastp_yeast" 150 150 61.04576] ["RP01893" "superfamily" 149 149 0.0] ["RP01893" "interpro" 151 151 0.0] ["RP01893" "wublastp_slimSwissProt" 152 152 64.04576] ["RP01893" "wublastp_yeast" 150 150 61.1549] ["RP01893" "wublastp_briggsae" 153 153 96.69897] ["RP01893" "wublastp_human" 131 131 62.30103] ["RP01893" "wublastp_slimSwissProt" 151 151 65.09691] ["RP01893" "wublastp_tmuris" 153 153 71.69897] ["RP01893" "wublastp_human" 150 150 72.0] ["RP01893" "wublastp_slimSwissProt" 149 149 60.69897] ["RP01893" "wublastp_slimSwissProt" 149 149 63.52288] ["RP01893" "wublastp_japonica" 150 150 96.52288] ["RP01893" "interpro" 150 150 0.0] ["RP01893" "wublastp_sratti" 149 149 81.30103] ["RP01893" "wublastp_worm" 153 153 105.1549] ["RP01893" "tigrfam" 150 150 224.4] ["RP01893" "wublastp_slimSwissProt" 150 150 71.39794]}

;; Query across both main and homol dbs gives same result,
;; but verifies that protein/id exists in both (join om string id)
(d/q '[:find ?pid ?method ?min ?max ?score
             :in $mdb $hdb ?pid
             :where 
             [$mdb ?ae :protein/id ?pid]
             [$hdb ?e :protein/id ?pid]
             [$hdb ? :homology/protein ?e]
             [$hdb ?lp :locatable/parent ?e]
             [$hdb ?lp :locatable/method ?mid]
             [$hdb ?mid :method/id ?method]
             [$hdb ?lp :locatable/max ?min]
             [$hdb ?lp :locatable/max ?max]
             [$hdb ?lp :locatable/score ?score]]
           mdb hdb
           "RP01893")       
mgrbyte commented 4 years ago

Multi-DB setup sounds good to me! It's quite a neat PR that you've put together.

I'm just curious that if we will continue to build Datomic databases from scratch each release, is it still preferable that we separate the homology data from the rest?

I think so. This approach (separate db) was suggested by @khowe , but is something that we've discussed before...

Currently, a subset of the data that's stored in the final/build ACeDB database is computed as a result of running various EnsEMBL pipelines (Compara, BLASTP, etc). Where possible, moving the data that's computed into a separate store makes sense - as adding back into the main database "ties our hands" in the long term to making a build database.

It can be argued that datomic isn't the right database platform for storing such computed data, as all schema items have no useful history (and should be marked with :db/noHistory in the schema for the new db if not already).

I believe the choice of using datomic here (instead of some other DB) is one of convenience, familiarity and expediency.

sibyl229 commented 4 years ago

@mgrbyte Thanks a lot for explaining the reasoning.

mgrbyte commented 4 years ago

@a8wright @sibyl229 Just to note that I'm currently doing another run of the main migration (WS273) with this code, just to check that these changes won't break subsequent migration runs. Once done, I'll merge this. Thanks!