eltonfss commented 1 year ago

Version

4.4.0

Question

This question has been also published at:

Stack Overflow: https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files
Mailing List: https://lists.apache.org/thread/jxcfhkly7781k8hnw2qdy09fbj3xych8

Scenario Description (Context)

I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShift Cluster.

OS Version Info (cat /etc/os-release):
NAME="Red Hat Enterprise Linux"
VERSION="8.5 (Ootpa)"
ID="rhel"
ID_LIKE="fedora" ="8.5"
...

Hardware Info (from Jena Fuseki initialization log):

[2023-01-27 20:08:59] Server INFO Memory: 32.0 GiB
[2023-01-27 20:08:59] Server INFO Java: 11.0.14.1
[2023-01-27 20:08:59] Server INFO OS: Linux 3.10.0-1160.76.1.el7.x86_64 amd64
[2023-01-27 20:08:59] Server INFO PID: 1

Disk Info (df -h):

Filesystem Size Used Avail Use% Mounted on
overlay 99G 76G 18G 82% /
tmpfs 64M 0 64M 0% /dev
tmpfs 63G 0 63G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
/dev/mapper/docker_data 99G 76G 18G 82% /config
/data 1.0T 677G 348G 67% /usr/app/run
tmpfs 40G 24K 40G 1%

My dataset is built using TDB2, and currently has the following RDF Stats: · Triples: 65KK (Approximately 65 million) · Subjects: ~20KK (Aproximately 20 million) · Objects: ~8KK (Aproximately 8 million) · Graphs: ~213K (Aproximately 213 thousand) · Predicates: 153

The files corresponding to this dataset alone on disk sum up to approximately 671GB (measured with du -h). From these, the largest files are:

· /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
· /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
· /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
· /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
· /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
· /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB

Main Questions

Q1: I've been using Jena for quite some time now and I'm well aware that its indexes grow significantly during usage, specially when triples are being added across multiple requests (transactional workloads). What is the main reason for this? Are the indexes being replicated somehow?
Q2: I've looked into several documentation pages, source code, forums, ... nowhere I was able to find some explanation to why OSPG.dat is so much larger than all other files. Is there a reasonable explanation for this based on the content of the dataset or the way it was generated?
Q3: Could this be an indexing bug within TDB2? Should it be solved by upgrading to Jena 4.7.0?

Appendix

Assembler configuration for my dataset:

@prefix : http://base/# .
@prefix fuseki: http://jena.apache.org/fuseki# .
@prefix ja: http://jena.hpl.hp.com/2005/11/Assembler# .
@prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# .
@prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
@prefix root: http://dev-test-jena-fuseki/$/datasets .
@prefix tdb2: http://jena.apache.org/2016/tdb# .

tdb2:GraphTDB rdfs:subClassOf ja:Model .

ja:ModelRDFS rdfs:subClassOf ja:Model .

ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .

http://jena.hpl.hp.com/2008/tdb#DatasetTDB
rdfs:subClassOf ja:RDFDataset .

tdb2:GraphTDB2 rdfs:subClassOf ja:Model .

http://jena.apache.org/text#TextDataset
rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .

:service_tdb_my-dataset
rdf:type fuseki:Service ;
rdfs:label "TDB my-dataset" ;
fuseki:dataset :ds_my-dataset ;
fuseki:name "my-dataset" ;
fuseki:serviceQuery "sparql" , "query" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:serviceReadWriteGraphStore
"data" ;
fuseki:serviceUpdate "update" ;
fuseki:serviceUpload "upload" .

ja:ViewGraph rdfs:subClassOf ja:Model .

ja:GraphRDFS rdfs:subClassOf ja:Model .

tdb2:DatasetTDB rdfs:subClassOf ja:RDFDataset .

http://jena.hpl.hp.com/2008/tdb#GraphTDB
rdfs:subClassOf ja:Model .

ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .

tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .

ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .

ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .

ja:DatasetRDFS rdfs:subClassOf ja:RDFDataset .

:ds_my-dataset rdf:type tdb2:DatasetTDB2 ;
tdb2:location "run/databases/my-dataset" ;
tdb2:unionDefaultGraph true ;
ja:context \[ ja:cxtName "arq:optFilterPlacement" ;
ja:cxtValue "false"
\] .

My Dataset Compression experiment

After getting some feedbacks from Jena support through the Mailing List, I've tried to run two compression strategies on this dataset to see which one would work best. The one I'm referring to as "official" is the one that uses the "/$/compact" endpoint and the one I'm referring to as "unofficial" is the one where I create an NQuads backup and upload it to a new dataset using the TDBLoader. The reason I attempted this second strategy is because a StackOverflow post suggested that it could be significantly more efficient than the "official" strategy (https://stackoverflow.com/questions/60501386/compacting-a-dataset-in-apache-jena-fuseki/60631699#60631699).

Here is a summary of the results I've obtained with both compression strategies (in markdown notation):

Original Dataset

RDF Stats:

Triples: 65222513 (Approximately 65 million)
Subjects: 20434264 (Aproximately 20 million)
Objects: 8565221 (Aproximately 8 million)
Graphs: 213531 (Aproximately 213 thousand)
Predicates: 153

Disk Stats:

my-dataset/Data-0001: 671GB
my-dataset/Data-0001/OSPG.dat: 243GB
my-dataset/Data-0001/nodes.dat: 76GB
my-dataset/Data-0001/POSG.dat: 35GB
my-dataset/Data-0001/nodes.idn: 33GB
my-dataset/Data-0001/POSG.idn: 29GB
my-dataset/Data-0001/OSPG.idn: 27GB
...

Dataset Replica ("unofficial" compression strategy)

Description: Backed up dataset as NQuads and Restore it as a new dataset with TDBLoader.

References:

RDF Stats:

Triples: 65222513 (Approximately 65 million)
Subjects: 20434264 (Aproximately 20 million)
Objects: 8565221 (Aproximately 8 million)
Graphs: 213531 (Aproximately 213 thousand)
Predicates: 153

Disk Stats:

my-dataset-replica/Data-0001: 23GB
my-dataset-replica/Data-0001/OSPG.dat: 3.5GB
my-dataset-replica/Data-0001/nodes.dat: 680MB
my-dataset-replica/Data-0001/POSG.dat: 3.6GB
my-dataset-replica/Data-0001/nodes.idn: 8.0M
my-dataset-replica/Data-0001/POSG.idn: 32M
my-dataset-replica/Data-0001/OSPG.idn: 32M
...

Compressed Dataset ("official" compression strategy)

Description: Compressed using /$/compact/ endpoint generating a new Data-NNNN folder within the same dataset.

References:

https://jena.apache.org/documentation/tdb2/tdb2_admin.html#compaction

RDF Stats:

Triples: 65222513 (Approximately 65 million)
Subjects: 20434264 (Aproximately 20 million)
Objects: 8565221 (Aproximately 8 million)
Graphs: 213531 (Aproximately 213 thousand)
Predicates: 153

Disk Stats:

my-dataset/Data-0002: 23GB
my-dataset/Data-0002/OSPG.dat: 3.7GB
my-dataset/Data-0002/nodes.dat: 680MB
my-dataset/Data-0002/POSG.dat: 3.8GB
my-dataset/Data-0002/nodes.idn: 8.0M
my-dataset/Data-0002/POSG.idn: 40M
my-dataset/Data-0002/OSPG.idn: 32M
...

Comparison

RDF Stats:

Triples: Same Count
Subjects: Same Count
Objects: Same Count
Graphs: Same Count
Predicates: Same Count

Disk Stats:

Total Space: ~29x reduction with both strategies
OSPG.dat: ~69x reduction with replication and ~65x reduction with compression
nodes.dat: ~111x reduction with both strategies
POSG.dat: ~9,7x reduction with replication and ~7,6x reduction with compression
nodes.idn: ~4125x reduction with both strategies
POSG.idn: ~906x reduction with replication and ~725x reduction with compression
OSPG.idn: ~843,75 reduction with both strategies

Queries used to obtain the RDF Stats

Triples

SELECT (COUNT(*) as ?count)
WHERE {
  GRAPH ?graph {
    ?subject ?predicate ?object
  }
}

Graphs

SELECT (COUNT(DISTINCT ?graph) as ?count)
WHERE {
  GRAPH ?graph {
    ?subject ?predicate ?object
  }
}

Subjects

SELECT (COUNT(DISTINCT ?subject) as ?count)
WHERE {
  GRAPH ?graph {
    ?subject ?predicate ?object
  }
}

Predicates

SELECT (COUNT(DISTINCT ?predicate) as ?count)
WHERE {
  GRAPH ?graph {
    ?subject ?predicate ?object
  }
}

Objects

SELECT (COUNT(DISTINCT ?object) as ?count)
WHERE {
  GRAPH ?graph {
    ?subject ?predicate ?object
  }
}

Comands used to measure the Disk Stats

File Sizes

ls -lh --sort=size

Directory Sizes

du -h

afs commented 1 year ago

Could this be an indexing bug within TDB2?

Highly unlikely. (Much more likely that host environment is reporting sizes inconsistently which happens.)

Answered on https://lists.apache.org/thread/jxcfhkly7781k8hnw2qdy09fbj3xych8

The solution is to run compaction occasionally then your files are 3.5GB to 4GB.

All the indexes contain the same information, in a different order. The size variation is down to how the B+trees split.

An external process interfering with the files is a more likely cause. The TDB file locking can not ensure that the host has not had a process that has messed with the files.

Should it be solved by upgrading to Jena 4.7.0? Asking the same question (and not incorporating the answers) will not help you.

4.7.0 wouldn't change the growth situation - it does make compaction in a live server more reliable.

kinow commented 1 year ago

@afs should we move this to a discussion?

afs commented 1 year ago

Whatever. I don't have anything to add.

I hope the OP expectation is not that there is some support team to respond to users.

eltonfss commented 1 year ago

Dear @afs and @kinow,

The intention of creating this issue (after posting on StackOverflow and Mailing List) was to make the documentation of this case as accessible as possible, in case someone else has the same issue or a different perspective on the why it occurred and how it could be solved.

That is why I also added the StackOverflow and Mailing List links right at the beginning, so anyone looking at this could have the full picture of what was discussed.

The hypothesis of the index being corrupted by an external process could be true, if someone tried to attach the same volume to another container for backup purposes for example. I'll try to investigate if that occurred in this particular case.

Nonetheless, if there was some other possible cause for the increased OSPG.dat growth, such as a particular triple update pattern, we would be able to investigate ways in which we could change our system to avoid that.

My apologies if this issue sounded flooding or implied that we were expecting some kind of support. All we seek is shared understanding and finding the best solution.

Many thanks for your help!

afs commented 1 year ago

@eltonfss The reports so far haven't described your usage.

afs commented 1 year ago

This discussion seems to have concluded. The advice on the email was to run a compaction.

apache / jena

Why does the OSPG.dat file grows so much? #1735

Version

Question

Scenario Description (Context)

Main Questions

Appendix

Assembler configuration for my dataset:

My Dataset Compression experiment

Original Dataset

Dataset Replica ("unofficial" compression strategy)

Compressed Dataset ("official" compression strategy)

Comparison

Queries used to obtain the RDF Stats

Triples

Graphs

Subjects

Predicates

Objects

Comands used to measure the Disk Stats

File Sizes

Directory Sizes