gchq / Gaffer

A large-scale entity and relation database supporting aggregation of properties
Apache License 2.0
1.75k stars 353 forks source link

Update examples and documentation for sketches library #1067

Closed gaffer01 closed 6 years ago

gaffer01 commented 7 years ago

The examples and documentation for the sketches library should be updated to:

As part of this the properties guide could be split up into subsections so that it's more obvious that there are different libraries of properties available (e.g. sketches, timestamps). It would also be worth noting in the properties guide that any Java object can be used as a property (although some stores may require that a serialiser is provided), and possibly provide an example of how to write a serialiser.

p013570 commented 6 years ago

The updated Properties page for the wiki is:

Copyright 2016-2017 Crown Copyright

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This page has been generated from code. To make any changes please update the walkthrough docs in the doc module, run it and replace the content of this page with the output.

  1. Introduction
  2. Running the Examples
  3. Simple properties
  4. Sketches
  5. Timestamps
  6. Walkthroughs
    1. HyperLogLogPlus
    2. HllSketch
    3. LongsSketch
    4. DoublesSketch
    5. ReservoirItemsSketch
    6. ThetaSketch
    7. RBMBackedTimestampSet
    8. BoundedTimestampSet
  7. Predicates, aggregators and serialisers
    1. String
    2. Long
    3. Integer
    4. Double
    5. Float
    6. Byte[]
    7. Boolean
    8. Date
    9. TypeValue
    10. TypeSubTypeValue
    11. FreqMap
    12. HashMap
    13. TreeSet
    14. HyperLogLogPlus
    15. HllSketch
    16. LongsSketch
    17. DoublesSketch
    18. ReservoirItemsSketch
    19. Sketch
    20. RBMBackedTimestampSet
    21. BoundedTimestampSet

Introduction

Gaffer allows properties to be stored on Entities and Edges. As well as simple properties, such as a String or Integer, Gaffer allows rich properties such as sketches and sets of timestamps to be stored on Elements. Gaffer's ability to continuously aggregate properties on elements allows interesting, dynamic data structures to be stored within the graph. Examples include storing a HyperLogLog sketch on an Entity to give a very quick estimate of the degree of a node or storing a uniform random sample of the timestamps that an edge was seen active.

Gaffer allows any Java object to be used as a property. If the property is not natively supported by Gaffer, then you will need to provide a serialiser, and possibly an aggregator.

The properties that Gaffer natively supports can be divided into three categories:

This documentation gives some examples of how to use all of the above types of property.

Running the Examples

The example can be run in a similar way to the user and developer examples.

You can download the doc-jar-with-dependencies.jar from maven central. Select the latest version and download the jar-with-dependencies.jar file. Alternatively you can compile the code yourself by running a "mvn clean install -Pquick". The doc-jar-with-dependencies.jar file will be located here: doc/target/doc-jar-with-dependencies.jar.

# Replace <DoublesUnion> with your example name.
java -cp doc-jar-with-dependencies.jar uk.gov.gchq.gaffer.doc.properties.walkthrough.DoublesSketchWalkthrough

Simple properties

Gaffer supports the storage of some common Java objects as properties on entities and edges. These include Integer, Long, Double, Float, Boolean, Date, String, byte[] and TreeSet. Serialisers for these will automatically be added to your schema when you create a graph using a schema that uses these properties. Aggregators for these properties are provided by the Koryphe library and include all the standard functions such as minimum, maximum, sum, etc.

Gaffer also provides a FreqMap property. This is a map from string to long.

The Getting started documentation includes examples of how to use these properties.

Sketches

A sketch is a compact data structure that gives an approximate answer to a question. For example, a HyperLogLog sketch can estimate the cardinality of a set with billions of elements with a small relative error, using orders of magnitude less storage than storing the full set.

Gaffer allows sketches to be stored on Entities and Edges. These sketches can be continually updated as new data arrives. Here are some example applications of sketches in Gaffer:

Gaffer provides serialisers and aggregators for sketches from two different libraries: the Clearspring library and the Datasketches library.

For the Clearspring library, a serialiser and an aggregator is provided for the HyperLogLogPlus sketch. This is an implementation of the HyperLogLog++ algorithm described in this paper.

For the Datasketches library, serialisers and aggregators are provided for several sketches. These sketches include:

Most of the Datasketches sketches come in two forms: a standard sketch form and a "union" form. The latter is technically not a sketch. It is an operator that allows efficient union operations of two sketches. It also allows updating the sketch with individual items. In order to obtain estimates from it, it is necessary to first obtain a sketch from it, using a method called getResult(). There are some interesting trade-offs in the serialisation and aggregation speeds between the sketches and the unions. If in doubt, use the standard sketches. Examples are provided for the standard sketches, but not for the unions.

Timestamps

Gaffer contains a time-library containing some simple properties which allow sets of timestamps to be stored on entities and edges. There are two properties:

Walkthroughs

This section contains examples that show how to use some of the properties described above.

HyperLogLogPlus

The code for this example is HyperLogLogPlusWalkthrough.

This example demonstrates how the HyperLogLogPlus sketch from the Clearspring library can be used to maintain an estimate of the degree of a vertex. Every time an edge A -> B is added to graph, we also add an Entity for A with a property of a HyperLogLogPlus containing B, and an Entity for B with a property of a HyperLogLogPlus containing A. The aggregator for the HyperLogLogPluses merges them together so that after querying for the Entity for vertex X the HyperLogLogPlus property gives us an estimate of the approximate degree.

Elements schema

This is our new elements schema. The edge has a property called 'approx_cardinality'. This will store the HyperLogLogPlus object.

{
  "entities": {
    "cardinality": {
      "vertex": "vertex.string",
      "properties": {
        "approxCardinality": "hyperloglogplus"
      }
    }
  }
}
Types schema

We have added a new type - 'hyperloglogplus'. This is a com.clearspring.analytics.stream.cardinality.HyperLogLogPlus object. We also added in the serialiser and aggregator for the HyperLogLogPlus object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new entities are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "hyperloglogplus": {
      "class": "com.clearspring.analytics.stream.cardinality.HyperLogLogPlus",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.clearspring.cardinality.binaryoperator.HyperLogLogPlusAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.clearspring.cardinality.serialisation.HyperLogLogPlusSerialiser"
      }
    }
  }
}

Only one entity is in the graph. This was added 1000 times, and each time it had the 'approxCardinality' property containing a vertex that A had been seen in an Edge with. Here is the Entity:

Entity[vertex=A,group=cardinality,properties=Properties[approxCardinality=<com.clearspring.analytics.stream.cardinality.HyperLogLogPlus>com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@756cf158]]

This is not very illuminating as this just shows the default toString() method on the sketch.

We can fetch the cardinality for the vertex using the following code:

final GetElements query = new GetElements.Builder()
        .input(new EntitySeed("A"))
        .build();
final Element element;
try (final CloseableIterable<? extends Element> elements = graph.execute(query, user)) {
    element = elements.iterator().next();
}
final HyperLogLogPlus hyperLogLogPlus = (HyperLogLogPlus) element.getProperty("approxCardinality");
final double approxDegree = hyperLogLogPlus.cardinality();
final String degreeEstimate = "Entity A has approximate degree " + approxDegree;

The results are as follows. As an Entity was added 1000 times, each time with a different vertex, then we would expect the degree to be approximately 1000.

Entity A has approximate degree 1113.0

HllSketch

The code for this example is HllSketchWalkthrough.

This example demonstrates how the HllSketch sketch from the Data Sketches library can be used to maintain an estimate of the degree of a vertex. Every time an edge A -> B is added to graph, we also add an Entity for A with a property of a HllSketch containing B, and an Entity for B with a property of a HllSketch containing A. The aggregator for the HllSketches merges them together so that after querying for the Entity for vertex X the HllSketch property would give us an estimate of the approximate degree.

Elements schema

This is our new elements schema. The edge has a property called 'approx_cardinality'. This will store the HllSketch object.

{
  "entities": {
    "cardinality": {
      "vertex": "vertex.string",
      "properties": {
        "approxCardinality": "hllsketch"
      }
    }
  }
}
Types schema

We have added a new type - 'hllsketch'. This is a com.yahoo.sketches.hll.HllSketch object. We also added in the serialiser and aggregator for the HllSketch object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new entities are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "hllsketch": {
      "class": "com.yahoo.sketches.hll.HllSketch",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.cardinality.binaryoperator.HllSketchAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.cardinality.serialisation.HllSketchSerialiser"
      }
    }
  }
}

Only one entity is in the graph. This was added 1000 times, and each time it had the 'approxCardinality' property containing a vertex that A had been seen in an Edge with. Here is the Entity:

Entity[vertex=A,group=cardinality,properties=Properties[approxCardinality=<com.yahoo.sketches.hll.HllSketch>### HLL SKETCH SUMMARY: 
  Log Config K   : 10
  Hll Target     : HLL_8
  Current Mode   : HLL
  LB             : 986.7698164613868
  Estimate       : 1018.8398354963819
  UB             : 1053.0644294536246
  OutOfOrder Flag: true
  CurMin         : 0
  NumAtCurMin    : 374
  HipAccum       : 1007.7235730289277
]]

This is not very illuminating as this just shows the default toString() method on the sketch.

We can fetch the cardinality for the vertex using the following code:

final GetElements query = new GetElements.Builder()
        .input(new EntitySeed("A"))
        .build();
final Element element;
try (final CloseableIterable<? extends Element> elements = graph.execute(query, user)) {
    element = elements.iterator().next();
}
final HllSketch hllSketch = (HllSketch) element.getProperty("approxCardinality");
final double approxDegree = hllSketch.getEstimate();
final String degreeEstimate = "Entity A has approximate degree " + approxDegree;

The results are as follows. As an Entity was added 1000 times, each time with a different vertex, then we would expect the degree to be approximately 1000.

Entity A has approximate degree 1018.8398354963819

LongsSketch

The code for this example is LongsSketchWalkthrough.

This example demonstrates how the LongsSketch sketch from the Data Sketches library can be used to maintain estimates of the frequencies of longs stored on on vertices and edges. For example suppose every time an edge is observed there is a long value associated with it which specifies the size of the interaction. Storing all the different longs on the edge could be expensive in storage. Instead we can use a LongsSketch which will give us approximate counts of the number of times a particular long was observed.

Elements schema

This is our new elements schema. The edge has a property called 'longsSketch'. This will store the LongsSketch object.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "longsSketch": "longs.sketch"
      }
    }
  }
}
Types schema

We have added a new type - 'longs.sketch'. This is a com.yahoo.sketches.frequencies.LongsSketch object. We also added in the serialiser and aggregator for the LongsSketch object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "longs.sketch": {
      "class": "com.yahoo.sketches.frequencies.LongsSketch",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.frequencies.binaryoperator.LongsSketchAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.frequencies.serialisation.LongsSketchSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

Only one edge is in the graph. This was added 1000 times, and each time it had the 'longs.sketch' property containing a randomly generated long between 0 and 9 (inclusive). The sketch does not retain all the distinct occurrences of these long values, but allows one to estimate the number of occurrences of the different values. Here is the Edge:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[longsSketch=<com.yahoo.sketches.frequencies.LongsSketch>FrequentLongsSketch:
  Stream Length    : 1000
  Max Error Offset : 0
ReversePurgeLongHashMap:
         Index:     States              Values Keys
             0:          1                 112 0
             3:          1                  96 6
             5:          1                  92 4
             6:          2                 103 5
             7:          3                  98 9
             8:          2                  91 2
             9:          3                  98 8
            12:          1                 106 1
            13:          1                  99 7
            14:          1                 105 3
]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call methods on the LongsSketch object. Let's get estimates of the frequencies of the values 1 and 9.

We can fetch all cardinalities for all the vertices using the following code:

final GetElements query = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final Element edge;
try (final CloseableIterable<? extends Element> edges = graph.execute(query, user)) {
    edge = edges.iterator().next();
}
final LongsSketch longsSketch = (LongsSketch) edge.getProperty("longsSketch");
final String estimates = "Edge A-B: 1L seen approximately " + longsSketch.getEstimate(1L)
        + " times, 9L seen approximately " + longsSketch.getEstimate(9L) + " times.";

The results are as follows. As 1000 edges were generated with a long randomly sampled from 0 to 9 then the occurrence of each is approximately 100.

Edge A-B: 1L seen approximately 106 times, 9L seen approximately 98 times.

DoublesSketch

The code for this example is DoublesSketchWalkthrough.

This example demonstrates how the DoublesSketch sketch from the Data Sketches library can be used to maintain estimates of the quantiles of a distribution of doubles. Suppose that every time an edge is observed, there is a double value associated with it, for example a value between 0 and 1 giving the score of the edge. Instead of storing a property that contains all the doubles observed, we can store a DoublesSketch which will allow us to estimate the median double, the 99th percentile, etc.

Elements schema

This is our new elements schema. The edge has a property called 'doublesSketch'. This will store the DoublesSketch object.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "doublesSketch": "doubles.sketch"
      }
    }
  }
}
Types schema

We have added a new type - 'doubles.sketch'. This is a com.yahoo.sketches.quantiles.DoublesSketch object. We also added in the serialiser and aggregator for the DoublesSketch object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "doubles.sketch": {
      "class": "com.yahoo.sketches.quantiles.DoublesSketch",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.quantiles.binaryoperator.DoublesSketchAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.quantiles.serialisation.DoublesSketchSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}
Edge[source=A,destination=B,directed=false,group=red,properties=Properties[doublesSketch=<com.yahoo.sketches.quantiles.DirectUpdateDoublesSketchR>
### Quantiles DirectUpdateDoublesSketchR SUMMARY: 
   Empty                        : false
   Direct, Capacity bytes       : true, 4128
   Estimation Mode              : true
   K                            : 128
   N                            : 1,000
   Levels (Needed, Total, Valid): 2, 2, 2
   Level Bit Pattern            : 11
   BaseBufferCount              : 232
   Combined Buffer Capacity     : 512
   Retained Items               : 488
   Compact Storage Bytes        : 3,936
   Updatable Storage Bytes      : 4,128
   Normalized Rank Error        : 1.725%
   Min Value                    : -3.148
   Max Value                    : 3.112
### END SKETCH SUMMARY
]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call methods on the DoublesSketch object. We can get an estimate for the 25th, 50th and 75th percentiles on edge A-B using the following code:

final GetElements query = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final Element edge;
try (final CloseableIterable<? extends Element> edges = graph.execute(query, user)) {
    edge = edges.iterator().next();
}
final DoublesSketch doublesSketch = (DoublesSketch) edge.getProperty("doublesSketch");
final double[] quantiles = doublesSketch.getQuantiles(new double[]{0.25D, 0.5D, 0.75D});
final String quantilesEstimate = "Edge A-B with percentiles of double property - 25th percentile: " + quantiles[0]
        + ", 50th percentile: " + quantiles[1]
        + ", 75th percentile: " + quantiles[2];

The results are as follows. This means that 25% of all the doubles on edge A-B had value less than -0.66, 50% had value less than -0.01 and 75% had value less than 0.64 (the results of the estimation are not deterministic so there may be small differences between the values below and those just quoted).

Edge A-B with percentiles of double property - 25th percentile: -0.6630847714290219, 50th percentile: -0.009776218111167738, 75th percentile: 0.6311663168517678

We can also get the cumulative density predicate of the distribution of the doubles:

final GetElements query2 = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final Element edge2;
try (final CloseableIterable<? extends Element> edges2 = graph.execute(query2, user)) {
    edge2 = edges2.iterator().next();
}
final DoublesSketch doublesSketch2 = (DoublesSketch) edge2.getProperty("doublesSketch");
final double[] cdf = doublesSketch2.getCDF(new double[]{0.0D, 1.0D, 2.0D});
final String cdfEstimate = "Edge A-B with CDF values at 0: " + cdf[0]
        + ", at 1: " + cdf[1]
        + ", at 2: " + cdf[2];

The results are:

Edge A-B with CDF values at 0: 0.507, at 1: 0.843, at 2: 0.983

ReservoirItemsSketch

The code for this example is ReservoirItemsSketchWalkthrough.

This example demonstrates how the ReservoirItemsSketch sketch from the Data Sketches library can be used to maintain estimates of properties on vertices and edges. The ReservoirItemsSketch sketch allows a sample of a set of strings to be maintained. We give two examples of this. The first is if when an edge is observed there is a string property associated to it, and there are a lot of different values of that string. We may not want to store all the different values of the string, but we may want to see a random sample of them. The second example is to store on an Entity a sketch which gives a sample of the vertices that are connected to the vertex. Even if we are storing all the edges then producing a random sample of the vertices attached to a vertex may not be quick (for example if a vertex has degree 10,000 then producing a sample of a random 10 neighbours would require scanning all the edges - storing the sketch on the Entity means that the sample will be precomputed and can be returned without scanning the edges).

Elements schema

This is our new elements schema. The edge has a property called 'stringsSample'. This will store the ReservoirItemsSketch object. The entity has a property called 'neighboursSample'. This will also store a ReservoirItemsSketch object.

{
  "entities": {
    "blueEntity": {
      "vertex": "vertex.string",
      "properties": {
        "neighboursSample": "reservoir.strings.sketch"
      }
    }
  },
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "stringsSample": "reservoir.strings.sketch"
      }
    },
    "blue": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false"
    }
  }
}
Types schema

We have added a new type - 'reservoir.strings.sketch'. This is a com.yahoo.sketches.sampling.ReservoirItemsSketch object. We also added in the serialiser and aggregator for the ReservoirItemsSketch object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "reservoir.strings.sketch": {
      "class": "com.yahoo.sketches.sampling.ReservoirItemsSketch",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.sampling.binaryoperator.ReservoirItemsSketchAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.sampling.serialisation.ReservoirStringsSketchSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

An edge A-B of group "red" was added to the graph 1000 times. Each time it had the stringsSample property containing a randomly generated string. Here is the edge:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[stringsSample=<com.yahoo.sketches.sampling.ReservoirItemsSketch>
### ReservoirItemsSketch SUMMARY: 
   k            : 20
   n            : 1000
   Current size : 20
   Resize factor: X8
### END SKETCH SUMMARY
]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call a method on the ReservoirItemsSketch object:

final GetElements query = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final Element edge;
try (final CloseableIterable<? extends Element> edges = graph.execute(query, user)) {
    edge = edges.iterator().next();
}
final ReservoirItemsSketch<String> stringsSketch = ((ReservoirItemsSketch<String>) edge.getProperty("stringsSample"));
final String[] samples = stringsSketch.getSamples();
final StringBuilder sb = new StringBuilder("10 samples: ");
for (int i = 0; i < 10 && i < samples.length; i++) {
    if (i > 0) {
        sb.append(", ");
    }
    sb.append(samples[i]);
}

The results contain a random sample of the strings added to the edge:

10 samples: FEFFDCDEJJ, GGGBFDBFBH, JEBEDEABAA, HFBEIJICIJ, BJBGDEAHCF, AEFGHBBJAJ, DHAFCGAAFC, BJBIAAJCBI, CBBHBFBHDA, CADFDIIGCC

500 edges of group "blue" were also added to the graph (edges X-Y0, X-Y1, ..., X-Y499). For each of these edges, an Entity was created for both the source and destination. Each Entity contained a 'neighboursSample' property that contains the vertex at the other end of the edge. We now get the Entity for the vertex X and display the sample of its neighbours:

final GetElements query2 = new GetElements.Builder()
        .input(new EntitySeed("X"))
        .build();
final Element entity;
try (final CloseableIterable<? extends Element> entities = graph.execute(query2, user)) {
    entity = entities.iterator().next();
}
final ReservoirItemsSketch<String> neighboursSketch = ((ReservoirItemsSketch<String>) entity.getProperty("neighboursSample"));
final String[] neighboursSample = neighboursSketch.getSamples();
sb.setLength(0);
sb.append("10 samples: ");
for (int i = 0; i < 10 && i < neighboursSample.length; i++) {
    if (i > 0) {
        sb.append(", ");
    }
    sb.append(neighboursSample[i]);
}

The results are:

10 samples: Y499, Y351, Y430, Y472, Y58, Y467, Y166, Y144, Y389, Y9

ThetaSketch

The code for this example is ThetaSketchWalkthrough.

This example demonstrates how the com.yahoo.sketches.theta.Sketch sketch from the Data Sketches library can be used to maintain estimates of the cardinalities of sets. This sketch is similar to a HyperLogLogPlusPlus, but it can also be used to estimate the size of the intersections of sets. We give an example of how this can be used to monitor the changes to the number of edges in the graph over time.

Elements schema

This is our new elements schema. The edge has properties called 'startDate' and 'endDate'. These will be set to the midnight before the time of the occurrence of the edge and to midnight after the time of the occurrence of the edge. There is also a size property which will be a theta Sketch. This property will be aggregated over the 'groupBy' properties of startDate and endDate.

{
  "entities": {
    "size": {
      "vertex": "vertex.string",
      "properties": {
        "startDate": "date.earliest",
        "endDate": "date.latest",
        "size": "thetasketch"
      },
      "groupBy": [
        "startDate",
        "endDate"
      ]
    }
  },
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "startDate": "date.earliest",
        "endDate": "date.latest",
        "count": "long.count"
      },
      "groupBy": [
        "startDate",
        "endDate"
      ]
    }
  }
}
Types schema

We have added a new type - 'thetasketch'. This is a com.yahoo.sketches.theta.Sketch object. We also added in the serialiser and aggregator for the Union object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "date.earliest": {
      "class": "java.util.Date",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ],
      "aggregateFunction": {
        "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Min"
      }
    },
    "date.latest": {
      "class": "java.util.Date",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ],
      "aggregateFunction": {
        "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Max"
      }
    },
    "long.count": {
      "class": "java.lang.Long",
      "aggregateFunction": {
        "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Sum"
      }
    },
    "thetasketch": {
      "class": "com.yahoo.sketches.theta.Sketch",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.theta.binaryoperator.SketchAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.theta.serialisation.SketchSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

1000 different edges were added to the graph for the day 09/01/2017 (i.e. the startDate was the midnight at the start of the 9th, and the endDate was the midnight at the end of the 9th). For each edge, an Entity was created, with a vertex called "graph". This contained a theta Sketch object to which a string consisting of the source and destination was added. 500 edges were added to the graph for the day 10/01/2017. Of these, 250 were the same as edges that had been added in the previous day, but 250 were new. Again, for each edge, an Entity was created for the vertex called "graph".

Here is the Entity for the different days:

Entity[vertex=graph,group=size,properties=Properties[size=<com.yahoo.sketches.theta.DirectCompactOrderedSketch>
### DirectCompactOrderedSketch SUMMARY: 
   Estimate                : 1000.0
   Upper Bound, 95% conf   : 1000.0
   Lower Bound, 95% conf   : 1000.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : false
   Array Size Entries      : 1000
   Retained Entries        : 1000
   Seed Hash               : 93cc
### END SKETCH SUMMARY
,endDate=<java.util.Date>Tue Jan 10 00:00:00 GMT 2017,startDate=<java.util.Date>Mon Jan 09 00:00:00 GMT 2017]]
Entity[vertex=graph,group=size,properties=Properties[size=<com.yahoo.sketches.theta.DirectCompactOrderedSketch>
### DirectCompactOrderedSketch SUMMARY: 
   Estimate                : 500.0
   Upper Bound, 95% conf   : 500.0
   Lower Bound, 95% conf   : 500.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : false
   Array Size Entries      : 500
   Retained Entries        : 500
   Seed Hash               : 93cc
### END SKETCH SUMMARY
,endDate=<java.util.Date>Wed Jan 11 00:00:00 GMT 2017,startDate=<java.util.Date>Tue Jan 10 00:00:00 GMT 2017]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call a method on the Sketch object:

final GetAllElements getAllEntities2 = new GetAllElements.Builder()
        .view(new View.Builder()
                .entity("size")
                .build())
        .build();
final CloseableIterable<? extends Element> allEntities2 = graph.execute(getAllEntities2, user);
final CloseableIterator<? extends Element> it = allEntities2.iterator();
final Element entityDay1 = it.next();
final Sketch sketchDay1 = ((Sketch) entityDay1.getProperty("size"));
final Element entityDay2 = it.next();
final Sketch sketchDay2 = ((Sketch) entityDay2.getProperty("size"));
final double estimateDay1 = sketchDay1.getEstimate();
final double estimateDay2 = sketchDay2.getEstimate();

The result is:

1000.0
500.0

Now we can get an estimate for the number of edges in common across the two days:

final Intersection intersection = Sketches.setOperationBuilder().buildIntersection();
intersection.update(sketchDay1);
intersection.update(sketchDay2);
final double intersectionSizeEstimate = intersection.getResult().getEstimate();

The result is:

250.0

We now get an estimate for the number of edges in total across the two days, by simply aggregating overall the properties:

final GetAllElements getAllEntities = new GetAllElements.Builder()
        .view(new View.Builder()
                .entity("size", new ViewElementDefinition.Builder()
                        .groupBy() // set the group by properties to 'none'
                        .build())
                .build())
        .build();
final Element entity;
try (final CloseableIterable<? extends Element> allEntities = graph.execute(getAllEntities, user)) {
    entity = allEntities.iterator().next();
}
final double unionSizeEstimate = ((Sketch) entity.getProperty("size")).getEstimate();

The result is:

1250.0

RBMBackedTimestampSet

The code for this example is TimestampSetWalkthrough.

This example demonstrates how the TimestampSet property can be used to maintain a set of the timestamps at which an element was seen active. In this example we record the timestamps to minute level accuracy, i.e. the seconds are ignored.

Elements schema

This is our new elements schema. The edge has a property called 'timestampSet'. This will store the TimestampSet object, which is actually a RBMBackedTimestampSet.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "timestampSet": "timestamp.set"
      }
    }
  }
}
Types schema

We have added a new type - 'timestamp.set'. This is a uk.gov.gchq.gaffer.time.RBMBackedTimestampSet object. We also added in the serialiser and aggregator for the RBMBackedTimestampSet object. Gaffer will automatically aggregate these sets together to maintain a set of all the times the element was active.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "timestamp.set": {
      "class": "uk.gov.gchq.gaffer.time.RBMBackedTimestampSet",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.time.binaryoperator.RBMBackedTimestampSetAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.time.serialisation.RBMBackedTimestampSetSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

Only one edge is in the graph. This was added 25 times, and each time it had the 'timestampSet' property containing a randomly generated timestamp from 2017. Here is the Edge:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[timestampSet=<uk.gov.gchq.gaffer.time.RBMBackedTimestampSet>RBMBackedTimestampSet[timeBucket=MINUTE,timestamps=2017-01-08T07:29:00Z,2017-01-18T10:41:00Z,2017-01-19T01:36:00Z,2017-01-31T16:16:00Z,2017-02-02T08:06:00Z,2017-02-12T14:21:00Z,2017-02-15T22:01:00Z,2017-03-06T09:03:00Z,2017-03-21T18:09:00Z,2017-05-08T15:34:00Z,2017-05-10T19:39:00Z,2017-05-16T10:44:00Z,2017-05-23T10:02:00Z,2017-05-28T01:52:00Z,2017-06-24T23:50:00Z,2017-07-27T09:34:00Z,2017-08-05T02:11:00Z,2017-09-07T07:35:00Z,2017-10-01T12:52:00Z,2017-10-23T22:02:00Z,2017-10-27T04:12:00Z,2017-11-01T02:45:00Z,2017-12-11T16:38:00Z,2017-12-22T14:40:00Z,2017-12-24T08:00:00Z]]]

You can see the list of timestamps on the edge. We can also get just the earliest, latest and total number of timestamps using methods on the TimestampSet object to get the following results:

Edge A-B was first seen at 2017-01-08T07:29:00Z, last seen at 2017-12-24T08:00:00Z, and there were 25 timestamps it was active.

BoundedTimestampSet

The code for this example is BoundedTimestampSetWalkthrough.

This example demonstrates how the BoundedTimestampSet property can be used to maintain a set of the timestamps at which an element was seen active. If this set becomes larger than a size specified by the user then a uniform random sample of the timestamps is maintained. In this example we record the timestamps to minute level accuracy, i.e. the seconds are ignored, and specify that at most 25 timestamps should be retained.

Elements schema

This is our new schema. The edge has a property called 'boundedTimestampSet'. This will store the BoundedTimestampSet object, which is actually a 'BoundedTimestampSet'.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "boundedTimestampSet": "bounded.timestamp.set"
      }
    }
  }
}
Types schema

We have added a new type - 'bounded.timestamp.set'. This is a uk.gov.gchq.gaffer.time.BoundedTimestampSet object. We have added in the serialiser and aggregator for the BoundedTimestampSet object. Gaffer will automatically aggregate these sets together to maintain a set of all the times the element was active. Once the size of the set becomes larger than 25 then a uniform random sample of size at most 25 of the timestamps is maintained.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "bounded.timestamp.set": {
      "class": "uk.gov.gchq.gaffer.time.BoundedTimestampSet",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.time.binaryoperator.BoundedTimestampSetAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.time.serialisation.BoundedTimestampSetSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

There are two edges in the graph. Edge A-B was added 3 times, and each time it had the 'boundedTimestampSet' property containing a randomly generated timestamp from 2017. Edge A-C was added 1000 times, and each time it also had the 'boundedTimestampSet' property containing a randomly generated timestamp from 2017. Here are the edges:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[boundedTimestampSet=<uk.gov.gchq.gaffer.time.BoundedTimestampSet>BoundedTimestampSet[timeBucket=MINUTE,state=NOT_FULL,maxSize=25,timestamps=2017-02-12T14:21:00Z,2017-03-21T18:09:00Z,2017-12-24T08:00:00Z]]]
Edge[source=A,destination=C,directed=false,group=red,properties=Properties[boundedTimestampSet=<uk.gov.gchq.gaffer.time.BoundedTimestampSet>BoundedTimestampSet[timeBucket=MINUTE,state=SAMPLE,maxSize=25,timestamps=2017-01-01T23:00:00Z,2017-01-15T04:54:00Z,2017-01-19T14:48:00Z,2017-01-20T22:54:00Z,2017-02-03T14:58:00Z,2017-02-18T13:02:00Z,2017-02-28T11:57:00Z,2017-03-23T08:49:00Z,2017-03-28T19:44:00Z,2017-04-21T22:01:00Z,2017-04-24T19:45:00Z,2017-04-26T22:03:00Z,2017-07-20T12:47:00Z,2017-07-29T17:40:00Z,2017-08-12T04:03:00Z,2017-09-09T04:08:00Z,2017-09-27T07:08:00Z,2017-10-16T17:31:00Z,2017-10-29T22:33:00Z,2017-10-30T02:15:00Z,2017-11-20T00:54:00Z,2017-11-20T17:11:00Z,2017-11-25T13:04:00Z,2017-12-18T08:07:00Z,2017-12-26T21:40:00Z]]]

You can see that edge A-B has the full list of timestamps on the edge, but edge A-C has a sample of the timestamps.

Predicates, aggregators and serialisers

String

Properties class: java.lang.String

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

Long

Properties class: java.lang.Long

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

Integer

Properties class: java.lang.Integer

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

Double

Properties class: java.lang.Double

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

Float

Properties class: java.lang.Float

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

Byte[]

Properties class: [Ljava.lang.Byte;

Predicates:

Aggregators:

To Bytes Serialisers:

Boolean

Properties class: java.lang.Boolean

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

Date

Properties class: java.util.Date

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

TypeValue

Properties class: uk.gov.gchq.gaffer.types.TypeValue

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

TypeSubTypeValue

Properties class: uk.gov.gchq.gaffer.types.TypeSubTypeValue

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

FreqMap

Properties class: uk.gov.gchq.gaffer.types.FreqMap

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

HashMap

Properties class: java.util.HashMap

Predicates:

Aggregators:

To Bytes Serialisers:

TreeSet

Properties class: java.util.TreeSet

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

HyperLogLogPlus

Properties class: com.clearspring.analytics.stream.cardinality.HyperLogLogPlus

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

HllSketch

Properties class: com.yahoo.sketches.hll.HllSketch

Predicates:

Aggregators:

To Bytes Serialisers:

LongsSketch

Properties class: com.yahoo.sketches.frequencies.LongsSketch

Predicates:

Aggregators:

To Bytes Serialisers:

DoublesSketch

Properties class: com.yahoo.sketches.quantiles.DoublesSketch

Predicates:

Aggregators:

To Bytes Serialisers:

ReservoirItemsSketch

Properties class: com.yahoo.sketches.sampling.ReservoirItemsSketch

Predicates:

Aggregators:

To Bytes Serialisers:

Sketch

Properties class: com.yahoo.sketches.theta.Sketch

Predicates:

Aggregators:

To Bytes Serialisers:

RBMBackedTimestampSet

Properties class: uk.gov.gchq.gaffer.time.RBMBackedTimestampSet

Predicates:

Aggregators:

To Bytes Serialisers:

BoundedTimestampSet

Properties class: uk.gov.gchq.gaffer.time.BoundedTimestampSet

Predicates:

Aggregators:

To Bytes Serialisers:

p013570 commented 6 years ago

Merged into develop.