Dictionary Performance - Githubissues

GoogleCodeExporter commented 8 years ago

SimpleCassandraMapDictionary has a terrible performance. This, in turn, leads a 
bad performance for RDF insert operations.

Original issue reported on code.google.com by andreas.josef.wagner on 22 Jan 2014 at 8:32

GoogleCodeExporter commented 8 years ago

Some pointers:
* http://www.w3.org/Submission/2011/SUBM-HDT-20110330/
* https://code.google.com/p/hdt-it/
=> see also [2]

Some papers:
* Compression of RDF Dictionaries [1]
* Binary RDF for scalable publishing, exchanging and consumption in the web of 
data [2]
* TripleBit: a Fast and Compact System for Large Scale RDF Data, Sect. 3.1, [3]

[1] http://dataweb.infor.uva.es/wp-content/uploads/2011/10/sac2012.pdf
[2] http://dataweb.infor.uva.es/wp-content/uploads/2012/03/WWW12-copy.pdf
[3] http://www.vldb.org/pvldb/vol6/p517-yuan.pdf

Original comment by andreas.josef.wagner on 19 Mar 2014 at 5:34

GoogleCodeExporter commented 8 years ago

Hm ... [3] seems to use the same dictionary approach as "Scalable join 
processing on very large RDF graphs" [4] 

- Andreas

[4] http://www.dblab.ntua.gr/~gtsat/collection/RDF%20stores/ScalableJoinsRDF.pdf

Original comment by andreas.josef.wagner on 19 Mar 2014 at 5:47

GoogleCodeExporter commented 8 years ago

Hi guys,
I did some commits on top of issue #19. Other than some minor optimizations
about

   - checkstyle
   - cobertura (additional test cases)
   - instance reusing
      -  ISharedConstants.INCLUDE_ALL_COMPOSITE_LOWER_BOUND /
      INCLUDE_ALL_COMPOSITE_HIGHER_BOUND: previously for each query two new
      Composite() instances were created. Since these are used only as
"readonly"
      criteria for determining bounds, now we have just 2 instances reused
      everywhere. Of course they cannot be used in case you need to chnage the
      state of the composites (like for example adding components).

   - SliceFilter: basically the same consideration; i removed the local
      instances created each time a query is executed (TripleStore,
QuadStore and
      CAndPCSlicesQueryIterator) in favour of shared instances so also here we
      have a lot of gain in terms of heap (young generation)

The most important news about dictionary are

1) Renaming
According with Andreas,

   - SimpleCassandraMapDictionary now is ReadOptimizedDictionary
   - InMemoryDictionary is WriteOptimizedDictionary

2) Dictionary interface, additional methods

For some column family, row key is composite, that is, is composed by two
or three identifiers (byte[]). The rules for encoding and decoding these
composite identifiers  were previously in different points in the code. So
for example some iterator had hard-coded something like

byte [] predicate = new byte[17];

Now, after introducing the second dictionary, those classes no longer
worked anymore because, for example, WriteOptimizedDictionary uses
variable-length identifiers (ReadOptimizedDictionary uses fixed length
identifiers). So, at the end, the Dictionary interface includes now all
methods for composing and decomposing those composite identifiers.

I thing now the interface is more consistent because it includes ALL what
CumulusRDF expects from a dictionary in term of services.

3) Class hierarchy

Base classes are in package *edu.kit.aifb.cumulus.store.dict*

   - INodeDictionary: dictionary interfaces; here I moved all methods
   needed from a client perspective. Previously some of those methods were in
   NodeDictionaryBase so clients (e.g. Store classes) had to use
   NodeDictionaryBase instead of INodeDictionary.
   - NodeDictionaryBase: abstract superlayer type containing shared and
   common behaviour between all concrete dictionaries*;*
   - NodeDictionaryWithCaching: extends NodeDictionaryBase and adds ids /
   values caching*;*

Concrete classes are in package *edu.kit.aifb.cumulus.store.dict.impl*

   - ReadOptimizedDictionary: previous SimpleCassandraMapDictionary. This
   class is still in progress because things discussed in this thread;
   - WriteOprimizedDictionary: a (dummy) dictionary using string encoding
   for generating identifiers. TODO: long literals handling as discussed in
   this thread.

4) Configurability
Dictionary can be injected when the store instance is created. It defaults
to ReadptmizedDictionary

Best,
Andrea

2014-03-19 18:47 GMT+01:00 <cumulusrdf@googlecode.com>:

Original comment by a.gazzarini@gmail.com on 23 Mar 2014 at 6:34

GoogleCodeExporter commented 8 years ago

Pointer for the read-optimized dictionary:

(1) As Andrea (in some previous email) already suggested, I also think we 
should use a layer of "approximate membership queries" [2] as a form of filter 
over the storage (cassandra). This way, we should be able to avoid many 
unnecessary cassandra probes. For the approximate membership queries we could 
either use the traditional BloomFilters (e.g., the BloomFilters in the Google 
Guave project [4]) or something more exotic like the Quotient filter [3]. For 
the latter, I found an implementation in [5]. However, BloomFilters are 
probably more "production-ready" than e.g. [5]. The only thing is, we have to 
de/serialize the BloomFilters in case the store closes ... Otherwise we can't 
reuse the BloomFilters properly. This is unfortunately a bit messy ...

(2) We should be a bit more careful with using putQuickBuffered-method in 
edu.kit.aifb.cumulus.util.hector.Index. If the store shuts down (for whatever 
reason), the memory-based caches will be lost. Therefore the data encoded with 
these IDs is not usable any more. So, having a "large" caches comes with a 
great risk.

Overall, I think we should rather spend some more memory on the "approximate 
membership queries" (e.g., bloom filters), than on the caches. Maybe don't use 
caches at all for the dictionaries (this would be the safest choice) ...

Kind regards
Andreas

[1] https://code.google.com/p/cumulusrdf/issues/detail?id=19
[2] http://en.wikipedia.org/wiki/File:Bloom_filter_speed.svg
[3] http://en.wikipedia.org/wiki/Quotient_filter
[4] http://code.google.com/p/guava-libraries/wiki/HashingExplained#BloomFilter
[5] https://github.com/Vyzen/trout

Original comment by andreas.josef.wagner on 31 Mar 2014 at 10:15

GoogleCodeExporter commented 8 years ago

Sorry I forgot: this issue has been fixed in 1.1.0

Original comment by a.gazzarini@gmail.com on 30 Apr 2014 at 3:52

Changed state: Fixed

farhana-haque / cumulusrdf

Dictionary Performance #19