EBISPOT / OLS

Ontology Lookup Service from SPOT at EBI
http://www.ebi.ac.uk/ols
Apache License 2.0
96 stars 40 forks source link

Reducing the database size of the Neo4j export #105

Closed dhimmel closed 3 years ago

dhimmel commented 8 years ago

The OLS team recently started hosting an archived copy of the Neo4j database on their FTP site (ftp://ftp.ebi.ac.uk/pub/databases/spot/ols/neo4j/). See https://github.com/EBISPOT/OLS/issues/104#issuecomment-236578233. The 5.7 GB database file took me 23 minutes to download on a 750 MBps download speed connection. While this file size is not the end of the world, it will slow down deployment and increase hosting.

I'm going to look more into the large database size and see if there are any ways to trim it down.

dhimmel commented 8 years ago

Currently available steps

  1. neostore.transaction.db have already been removed form the archive. See https://github.com/EBISPOT/OLS/issues/104#issuecomment-236627552.
  2. Currently, messages.log (including messages.log.1) takes up ~150 MB uncompressed. I believe these files can be safely removed. For future reference, in Neo4j 3.0, this log has been renamed to debug.log.
  3. Use xz rather than gzip compression. See https://github.com/EBISPOT/OLS/issues/104#issuecomment-236647924.
dhimmel commented 8 years ago

Large database files

I looked into how much space each database file consumed in its uncompressed state. Here are the largest files in the root database directory:

file size
neostore.propertystore.db.arrays 5.7G
neostore.propertystore.db.strings 3.8G
neostore.propertystore.db 2.6G
neostore.relationshipstore.db 565M
neostore.nodestore.db 167M

Other directories also contain several large files such as index/lucene/node/Resource.

My diagnosis is that the actual network (without properties) is maneagable: less than 1 GB uncompressed. The large database size seems to arise from:

  1. node and relationship properties, especially array and string properties
  2. property indexes

Both properties and their indexes can be super important, so they're not something we can just discard of. However, I was surprised by the large size of neostore.propertystore.db.arrays. Does anyone have an idea which nodes are have list properties that could be consuming lot's of storage? Or what specific properties are the big ones? It may be that there are a few low hanging fruits.

dhimmel commented 8 years ago

Identifying large node properties

I created the following query (thanks @keturn for help on the Neo4j slack) to find large properties.

MATCH (n)
WITH n
LIMIT 1000 // Remove to search all nodes
UNWIND keys(n) AS key
WITH
  n, key, n[key] AS value,
  reduce(concat = '', elem in n[key]| concat + toString(elem)) AS value_str
RETURN
  n.obo_id AS obo_id,
  n.label AS name,
  key,
  length(value_str) AS n_chars,
  value
ORDER BY n_chars DESC
LIMIT 50

Here's a table of the top 50 properties by the number of characters contained:

obo_id name key n_chars
Protein-protein interaction EBI-8401905 superClassDescription 459502
Protein-protein interaction EBI-8412660 superClassDescription 396269
Protein-protein interaction EBI-8312518 superClassDescription 376635
Protein-protein interaction EBI-8392519 superClassDescription 360240
Protein-protein interaction EBI-8419242 superClassDescription 326206
Protein-protein interaction EBI-8429350 superClassDescription 322924
Protein-protein interaction EBI-8333649 superClassDescription 298357
Protein-protein interaction EBI-8459929 superClassDescription 294775
Protein-protein interaction EBI-8408168 superClassDescription 292769
Protein-protein interaction EBI-8328780 superClassDescription 287235
Protein-protein interaction EBI-8357768 superClassDescription 266886
Protein-protein interaction EBI-8440340 superClassDescription 248893
Protein-protein interaction EBI-8338396 superClassDescription 245652
Protein-protein interaction EBI-8370015 superClassDescription 225302
Protein-protein interaction EBI-8387620 superClassDescription 224343
Protein-protein interaction EBI-8345298 superClassDescription 223653
Protein-protein interaction EBI-8383997 superClassDescription 222010
Protein-protein interaction EBI-8372752 superClassDescription 217132
Protein-protein interaction EBI-8375919 superClassDescription 211850
Protein-protein interaction EBI-8362519 superClassDescription 206605
Protein-protein interaction EBI-8350757 superClassDescription 191553
Protein-protein interaction EBI-776423 superClassDescription 190619
Protein-protein interaction EBI-8354201 superClassDescription 178127
Protein-protein interaction EBI-7950968 superClassDescription 174588
Protein-protein interaction EBI-8366098 superClassDescription 172892
Protein-protein interaction EBI-8381951 superClassDescription 167330
Protein-protein interaction EBI-7954144 superClassDescription 165116
GO:0016301 kinase activity obo_xref 164850
GO:0016301 kinase activity obo_xref 164850
GO:0016301 kinase activity obo_xref 164850
GO:0004674 protein serine/threonine kinase activity obo_xref 153200
GO:0004674 protein serine/threonine kinase activity obo_xref 153200
GO:0004674 protein serine/threonine kinase activity obo_xref 153200
Protein-protein interaction EBI-9919289 superClassDescription 153100
Protein-protein interaction EBI-3048562 superClassDescription 148971
Protein-protein interaction EBI-3046676 superClassDescription 145054
Protein-protein interaction EBI-9933573 superClassDescription 142205
Protein-protein interaction EBI-6191076 superClassDescription 141782
Protein-protein interaction EBI-9959939 superClassDescription 140887
Protein-protein interaction EBI-6191068 superClassDescription 138496
Protein-protein interaction EBI-9944837 superClassDescription 137658
Protein-protein interaction EBI-3050465 superClassDescription 132928
Protein-protein interaction EBI-6189915 superClassDescription 130972
Protein-protein interaction EBI-6285956 superClassDescription 130314
GAZ:00186143 Upper Mississippi River superClassDescription 130233
GAZ:00002594 Ohio River superClassDescription 126901
Orphanet:98878 Hemophilia A superClassDescription 123313
Protein-protein interaction EBI-9946542 superClassDescription 123233
GO:0004252 serine-type endopeptidase activity obo_xref 116882
GO:0004252 serine-type endopeptidase activity obo_xref 116882

These extreme properties contain as many characters as a whole book! It looks like the largest properties are either superClassDescriptions or obo_xrefs. I also saw a annotation-relatedMatch value which was super long. Note that the table has some duplicate rows, since a single obo_id can exist multiple times in the database.

Also what are these Protein-protein interaction EBI nodes? For reference, the IRI for Protein-protein interaction EBI-3050465 is http://identifiers.org/intact/EBI-3050465. It's not intuitive to me why a PPI would get encoded as a node rather than a relationship?

Does anyone have any thoughts on these large properties and whether they should be pruned upstream?

simonjupp commented 8 years ago

There's a new dump on the FTP that has xz compression. It's down to 2.7Gb.

I looked at the top entry, I don't see how the property value is so big. You can see the entry here http://www.ebi.ac.uk/ols/ontologies/cco/terms?iri=http%3A%2F%2Fidentifiers.org%2Fintact%2FEBI-8401905. I haven't looked directly in the database, but I don't understand your numbers.

dhimmel commented 8 years ago

There's a new dump on the FTP that has xz compression. It's down to 2.7Gb.

Nice! I see you also switched to international date format: https://github.com/EBISPOT/OLS/issues/104#issuecomment-236647924 :+1:.

I looked at the top entry, I don't see how the property value is so big. You can see the entry here. I haven't looked directly in the database, but I don't understand your numbers.

@simonjupp, I looked into the property further to help see what's going on. First, superClassDescription is an array property for Protein-protein interaction EBI-8401905 in CCO. I used the following query to return a table where each row is an element in the array.

MATCH (n:CCO)
WHERE n.label = 'Protein-protein interaction EBI-8401905'
UNWIND n.superClassDescription AS superClassDescription
RETURN length(superClassDescription) AS nchar, superClassDescription

Here is the first element:

<a href="http://semanticscience.org/resource/SIO_000139" class='ObjectProperty mansyntax' title="http://semanticscience.org/resource/SIO_000139">has&nbsp;agent</a> <span class='some'>some</span> <a href="http://identifiers.org/uniprot/Q8BVI4" class='mansyntax Class' title="http://identifiers.org/uniprot/Q8BVI4">DHPR_MOUSE</a>

Which if rendered, becomes: "has agent some DHPR_MOUSE". Each element contains ~328 characters and the array contains 1,403 elements resulting in the total of 459,502 characters for the property. The whole array contains highly similar elements. Another example would be "has agent some YBOX3_MOUSE".

simonjupp commented 8 years ago

ok that makes sense now. Not sure what I can do about it in the short term as this is something we use when rendering the term on a page in OLS. Although in this example the html is trivial, we have other cases where we need to render a complex OWL class description in Manchester syntax and it was much easier to generate the html for this at index time.

dhimmel commented 8 years ago

@simonjupp no worries. Let me gain some more experience for how you've encoded ontologies in the Neo4j database model. Then I'll be able to give some more concrete thoughts. I may also try to reduce your database to a barebones model, which will be more friendly for browser visualization.

simonjupp commented 8 years ago

OK, in case you missed it there is some additional documentation about the schema here http://www.ebi.ac.uk/ols/docs/neo4j-schema

jamesamcl commented 3 years ago

Closing due to inactivity; we can always reopen if this is still an issue