Closed dhimmel closed 3 years ago
neostore.transaction.db
have already been removed form the archive. See https://github.com/EBISPOT/OLS/issues/104#issuecomment-236627552.messages.log
(including messages.log.1
) takes up ~150 MB uncompressed. I believe these files can be safely removed. For future reference, in Neo4j 3.0, this log has been renamed to debug.log
.I looked into how much space each database file consumed in its uncompressed state. Here are the largest files in the root database directory:
file | size |
---|---|
neostore.propertystore.db.arrays | 5.7G |
neostore.propertystore.db.strings | 3.8G |
neostore.propertystore.db | 2.6G |
neostore.relationshipstore.db | 565M |
neostore.nodestore.db | 167M |
Other directories also contain several large files such as index/lucene/node/Resource
.
My diagnosis is that the actual network (without properties) is maneagable: less than 1 GB uncompressed. The large database size seems to arise from:
Both properties and their indexes can be super important, so they're not something we can just discard of. However, I was surprised by the large size of neostore.propertystore.db.arrays
. Does anyone have an idea which nodes are have list properties that could be consuming lot's of storage? Or what specific properties are the big ones? It may be that there are a few low hanging fruits.
I created the following query (thanks @keturn for help on the Neo4j slack) to find large properties.
MATCH (n)
WITH n
LIMIT 1000 // Remove to search all nodes
UNWIND keys(n) AS key
WITH
n, key, n[key] AS value,
reduce(concat = '', elem in n[key]| concat + toString(elem)) AS value_str
RETURN
n.obo_id AS obo_id,
n.label AS name,
key,
length(value_str) AS n_chars,
value
ORDER BY n_chars DESC
LIMIT 50
Here's a table of the top 50 properties by the number of characters contained:
obo_id | name | key | n_chars |
---|---|---|---|
Protein-protein interaction EBI-8401905 | superClassDescription | 459502 | |
Protein-protein interaction EBI-8412660 | superClassDescription | 396269 | |
Protein-protein interaction EBI-8312518 | superClassDescription | 376635 | |
Protein-protein interaction EBI-8392519 | superClassDescription | 360240 | |
Protein-protein interaction EBI-8419242 | superClassDescription | 326206 | |
Protein-protein interaction EBI-8429350 | superClassDescription | 322924 | |
Protein-protein interaction EBI-8333649 | superClassDescription | 298357 | |
Protein-protein interaction EBI-8459929 | superClassDescription | 294775 | |
Protein-protein interaction EBI-8408168 | superClassDescription | 292769 | |
Protein-protein interaction EBI-8328780 | superClassDescription | 287235 | |
Protein-protein interaction EBI-8357768 | superClassDescription | 266886 | |
Protein-protein interaction EBI-8440340 | superClassDescription | 248893 | |
Protein-protein interaction EBI-8338396 | superClassDescription | 245652 | |
Protein-protein interaction EBI-8370015 | superClassDescription | 225302 | |
Protein-protein interaction EBI-8387620 | superClassDescription | 224343 | |
Protein-protein interaction EBI-8345298 | superClassDescription | 223653 | |
Protein-protein interaction EBI-8383997 | superClassDescription | 222010 | |
Protein-protein interaction EBI-8372752 | superClassDescription | 217132 | |
Protein-protein interaction EBI-8375919 | superClassDescription | 211850 | |
Protein-protein interaction EBI-8362519 | superClassDescription | 206605 | |
Protein-protein interaction EBI-8350757 | superClassDescription | 191553 | |
Protein-protein interaction EBI-776423 | superClassDescription | 190619 | |
Protein-protein interaction EBI-8354201 | superClassDescription | 178127 | |
Protein-protein interaction EBI-7950968 | superClassDescription | 174588 | |
Protein-protein interaction EBI-8366098 | superClassDescription | 172892 | |
Protein-protein interaction EBI-8381951 | superClassDescription | 167330 | |
Protein-protein interaction EBI-7954144 | superClassDescription | 165116 | |
GO:0016301 | kinase activity | obo_xref | 164850 |
GO:0016301 | kinase activity | obo_xref | 164850 |
GO:0016301 | kinase activity | obo_xref | 164850 |
GO:0004674 | protein serine/threonine kinase activity | obo_xref | 153200 |
GO:0004674 | protein serine/threonine kinase activity | obo_xref | 153200 |
GO:0004674 | protein serine/threonine kinase activity | obo_xref | 153200 |
Protein-protein interaction EBI-9919289 | superClassDescription | 153100 | |
Protein-protein interaction EBI-3048562 | superClassDescription | 148971 | |
Protein-protein interaction EBI-3046676 | superClassDescription | 145054 | |
Protein-protein interaction EBI-9933573 | superClassDescription | 142205 | |
Protein-protein interaction EBI-6191076 | superClassDescription | 141782 | |
Protein-protein interaction EBI-9959939 | superClassDescription | 140887 | |
Protein-protein interaction EBI-6191068 | superClassDescription | 138496 | |
Protein-protein interaction EBI-9944837 | superClassDescription | 137658 | |
Protein-protein interaction EBI-3050465 | superClassDescription | 132928 | |
Protein-protein interaction EBI-6189915 | superClassDescription | 130972 | |
Protein-protein interaction EBI-6285956 | superClassDescription | 130314 | |
GAZ:00186143 | Upper Mississippi River | superClassDescription | 130233 |
GAZ:00002594 | Ohio River | superClassDescription | 126901 |
Orphanet:98878 | Hemophilia A | superClassDescription | 123313 |
Protein-protein interaction EBI-9946542 | superClassDescription | 123233 | |
GO:0004252 | serine-type endopeptidase activity | obo_xref | 116882 |
GO:0004252 | serine-type endopeptidase activity | obo_xref | 116882 |
These extreme properties contain as many characters as a whole book! It looks like the largest properties are either superClassDescription
s or obo_xref
s. I also saw a annotation-relatedMatch
value which was super long. Note that the table has some duplicate rows, since a single obo_id can exist multiple times in the database.
Also what are these Protein-protein interaction EBI
nodes? For reference, the IRI for Protein-protein interaction EBI-3050465
is http://identifiers.org/intact/EBI-3050465. It's not intuitive to me why a PPI would get encoded as a node rather than a relationship?
Does anyone have any thoughts on these large properties and whether they should be pruned upstream?
There's a new dump on the FTP that has xz compression. It's down to 2.7Gb.
I looked at the top entry, I don't see how the property value is so big. You can see the entry here http://www.ebi.ac.uk/ols/ontologies/cco/terms?iri=http%3A%2F%2Fidentifiers.org%2Fintact%2FEBI-8401905. I haven't looked directly in the database, but I don't understand your numbers.
There's a new dump on the FTP that has xz compression. It's down to 2.7Gb.
Nice! I see you also switched to international date format: https://github.com/EBISPOT/OLS/issues/104#issuecomment-236647924 :+1:.
I looked at the top entry, I don't see how the property value is so big. You can see the entry here. I haven't looked directly in the database, but I don't understand your numbers.
@simonjupp, I looked into the property further to help see what's going on. First, superClassDescription
is an array property for Protein-protein interaction EBI-8401905
in CCO. I used the following query to return a table where each row is an element in the array.
MATCH (n:CCO)
WHERE n.label = 'Protein-protein interaction EBI-8401905'
UNWIND n.superClassDescription AS superClassDescription
RETURN length(superClassDescription) AS nchar, superClassDescription
Here is the first element:
<a href="http://semanticscience.org/resource/SIO_000139" class='ObjectProperty mansyntax' title="http://semanticscience.org/resource/SIO_000139">has agent</a> <span class='some'>some</span> <a href="http://identifiers.org/uniprot/Q8BVI4" class='mansyntax Class' title="http://identifiers.org/uniprot/Q8BVI4">DHPR_MOUSE</a>
Which if rendered, becomes: "has agent some DHPR_MOUSE". Each element contains ~328 characters and the array contains 1,403 elements resulting in the total of 459,502 characters for the property. The whole array contains highly similar elements. Another example would be "has agent some YBOX3_MOUSE".
ok that makes sense now. Not sure what I can do about it in the short term as this is something we use when rendering the term on a page in OLS. Although in this example the html is trivial, we have other cases where we need to render a complex OWL class description in Manchester syntax and it was much easier to generate the html for this at index time.
@simonjupp no worries. Let me gain some more experience for how you've encoded ontologies in the Neo4j database model. Then I'll be able to give some more concrete thoughts. I may also try to reduce your database to a barebones model, which will be more friendly for browser visualization.
OK, in case you missed it there is some additional documentation about the schema here http://www.ebi.ac.uk/ols/docs/neo4j-schema
Closing due to inactivity; we can always reopen if this is still an issue
The OLS team recently started hosting an archived copy of the Neo4j database on their FTP site (
ftp://ftp.ebi.ac.uk/pub/databases/spot/ols/neo4j/
). See https://github.com/EBISPOT/OLS/issues/104#issuecomment-236578233. The 5.7 GB database file took me 23 minutes to download on a 750 MBps download speed connection. While this file size is not the end of the world, it will slow down deployment and increase hosting.I'm going to look more into the large database size and see if there are any ways to trim it down.