Update Properties Guide with recommended property types

p013570 commented 6 years ago

We have serialisers and aggregators for several different java classes. These should be listed in the Properties Guide alongside some simple examples of how to use them.

p013570 commented 6 years ago

Update Properties Guide:

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This page has been generated from code. To make any changes please update the walkthrough docs in the doc module, run it and replace the content of this page with the output.

Introduction
Walkthroughs
1. String
2. Long
3. Integer
4. Double
5. Float
6. Byte[]
7. Boolean
8. Date
9. TypeValue
10. TypeSubTypeValue
11. FreqMap
12. HashMap
13. TreeSet
14. HyperLogLogPlus
15. RoaringBitmap
16. DoublesUnion
17. LongsSketch
18. Union
19. ReservoirItemsUnion
20. RBMBackedTimestampSet
21. BoundedTimestampSet

Introduction

This properties documentation discusses some advanced property types that work nicely in Gaffer.

Running the Examples

The example can be run in a similar way to the user and developer examples.

You can download the doc-jar-with-dependencies.jar from maven central. Select the latest version and download the jar-with-dependencies.jar file. Alternatively you can compile the code yourself by running a "mvn clean install -Pquick". The doc-jar-with-dependencies.jar file will be located here: doc/target/doc-jar-with-dependencies.jar.

# Replace <DoublesUnion> with your example name.
java -cp doc-jar-with-dependencies.jar uk.gov.gchq.gaffer.doc.properties.dev.walkthrough.DoublesUnion

Walkthroughs

String

Properties class: java.lang.String

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.serialisation.implementation.StringSerialiser

Other Serialisers:

Long

Properties class: java.lang.Long

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.LongParquetSerialiser

Integer

Properties class: java.lang.Integer

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.IntegerParquetSerialiser

Double

Properties class: java.lang.Double

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.DoubleParquetSerialiser

Float

Properties class: java.lang.Float

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.FloatParquetSerialiser

Byte[]

Properties class: [Ljava.lang.Byte;

Predicates:

Aggregators:

uk.gov.gchq.koryphe.impl.binaryoperator.First

To Bytes Serialisers:

Boolean

Properties class: java.lang.Boolean

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.serialisation.implementation.BooleanSerialiser

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.BooleanParquetSerialiser

Date

Properties class: java.util.Date

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.DateParquetSerialiser

TypeValue

Properties class: uk.gov.gchq.gaffer.types.TypeValue

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.serialisation.TypeValueSerialiser

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.TypeValueParquetSerialiser

TypeSubTypeValue

Properties class: uk.gov.gchq.gaffer.types.TypeSubTypeValue

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.serialisation.TypeSubTypeValueSerialiser

FreqMap

Properties class: uk.gov.gchq.gaffer.types.FreqMap

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.FreqMapParquetSerialiser

HashMap

Properties class: java.util.HashMap

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.serialisation.implementation.MapSerialiser

TreeSet

Properties class: java.util.TreeSet

Predicates:

Aggregators:

To Bytes Serialisers:

Other Serialisers:

uk.gov.gchq.gaffer.parquetstore.serialisation.impl.TreeSetStringParquetSerialiser

HyperLogLogPlus

Properties class: com.clearspring.analytics.stream.cardinality.HyperLogLogPlus

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.sketches.serialisation.HyperLogLogPlusSerialiser

Other Serialisers:

RoaringBitmap

Properties class: org.roaringbitmap.RoaringBitmap

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.bitmap.serialisation.RoaringBitmapSerialiser

DoublesUnion

The code for this example is DoublesUnion.

This example demonstrates how the DoublesUnion sketch from the Data Sketches library can be used to maintain estimates of the quantiles of a distribution of doubles. Suppose that every time an edge is observed, there is a double value associated with it, for example a value between 0 and 1 giving the score of the edge. Instead of storing a property that contains all the doubles observed, we can store a DoublesUnion which will allow us to estimate the median double, the 99th percentile, etc.

Properties class: com.yahoo.sketches.quantiles.DoublesUnion

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.sketches.datasketches.quantiles.serialisation.DoublesUnionSerialiser

Elements schema

This is our new elements schema. The edge has a property called 'doublesUnion'. This will store the DoublesUnion object.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "doublesUnion": "doubles.union"
      }
    }
  }
}

Types schema

We have added a new type - 'doubles.union'. This is a com.yahoo.sketches.quantiles.DoublesUnion object. We also added in the serialiser and aggregator for the DoublesUnion object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "doubles.union": {
      "class": "com.yahoo.sketches.quantiles.DoublesUnion",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.quantiles.binaryoperator.DoublesUnionAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.quantiles.serialisation.DoublesUnionSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[doublesUnion=<com.yahoo.sketches.quantiles.DoublesUnionImpl>
### Quantiles DoublesUnionImpl
   maxK                         : 128
### Quantiles HeapUpdateDoublesSketch SUMMARY: 
   Empty                        : false
   Direct, Capacity bytes       : false, 
   Estimation Mode              : true
   K                            : 128
   N                            : 1,000
   Levels (Needed, Total, Valid): 2, 2, 2
   Level Bit Pattern            : 11
   BaseBufferCount              : 232
   Combined Buffer Capacity     : 512
   Retained Items               : 488
   Compact Storage Bytes        : 3,936
   Updatable Storage Bytes      : 4,128
   Normalized Rank Error        : 1.725%
   Min Value                    : -3.148
   Max Value                    : 3.112
### END SKETCH SUMMARY
]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call methods on the DoublesUnion object. We can get an estimate for the 25th, 50th and 75th percentiles on edge A-B using the following code:

final GetElements query = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final CloseableIterable<? extends Element> edges = graph.execute(query, user);
final Element edge = edges.iterator().next();
final com.yahoo.sketches.quantiles.DoublesUnion doublesUnion = (com.yahoo.sketches.quantiles.DoublesUnion) edge.getProperty("doublesUnion");
final double[] quantiles = doublesUnion.getResult().getQuantiles(new double[]{0.25D, 0.5D, 0.75D});
final String quantilesEstimate = "Edge A-B with percentiles of double property - 25th percentile: " + quantiles[0]
        + ", 50th percentile: " + quantiles[1]
        + ", 75th percentile: " + quantiles[2];

The results are as follows. This means that 25% of all the doubles on edge A-B had value less than -0.66, 50% had value less than -0.01 and 75% had value less than 0.64 (the results of the estimation are not deterministic so there may be small differences between the values below and those just quoted).

Edge A-B with percentiles of double property - 25th percentile: -0.6630847714290219, 50th percentile: -0.0071624422787210824, 75th percentile: 0.6341803995604817

We can also get the cumulative density predicate of the distribution of the doubles:

final GetElements query2 = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final CloseableIterable<? extends Element> edges2 = graph.execute(query2, user);
final Element edge2 = edges2.iterator().next();
final DoublesSketch doublesSketch2 = ((com.yahoo.sketches.quantiles.DoublesUnion) edge2.getProperty("doublesUnion")).getResult();
final double[] cdf = doublesSketch2.getCDF(new double[]{0.0D, 1.0D, 2.0D});
final String cdfEstimate = "Edge A-B with CDF values at 0: " + cdf[0]
        + ", at 1: " + cdf[1]
        + ", at 2: " + cdf[2];

The results are:

Edge A-B with CDF values at 0: 0.506, at 1: 0.839, at 2: 0.983

LongsSketch

The code for this example is LongsSketch.

This example demonstrates how the LongsSketch sketch from the Data Sketches library can be used to maintain estimates of the frequencies of longs stored on on vertices and edges. For example suppose every time an edge is observed there is a long value associated with it which specifies the size of the interaction. Storing all the different longs on the edge could be expensive in storage. Instead we can use a LongsSketch which will give us approximate counts of the number of times a particular long was observed.

Properties class: com.yahoo.sketches.frequencies.LongsSketch

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.sketches.datasketches.frequencies.serialisation.LongsSketchSerialiser

Elements schema

This is our new elements schema. The edge has a property called 'longsSketch'. This will store the LongsSketch object.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "longsSketch": "longs.sketch"
      }
    }
  }
}

Types schema

We have added a new type - 'longs.sketch'. This is a com.yahoo.sketches.frequencies.LongsSketch object. We also added in the serialiser and aggregator for the LongsSketch object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "longs.sketch": {
      "class": "com.yahoo.sketches.frequencies.LongsSketch",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.frequencies.binaryoperator.LongsSketchAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.frequencies.serialisation.LongsSketchSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

Only one edge is in the graph. This was added 1000 times, and each time it had the 'longs.sketch' property containing a randomly generated long between 0 and 9 (inclusive). The sketch does not retain all the distinct occurrences of these long values, but allows one to estimate the number of occurrences of the different values. Here is the Edge:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[longsSketch=<com.yahoo.sketches.frequencies.LongsSketch>FrequentLongsSketch:
  Stream Length    : 1000
  Max Error Offset : 0
ReversePurgeLongHashMap:
         Index:     States              Values Keys
             0:          1                 112 0
             3:          1                  96 6
             5:          1                  98 9
             6:          2                  92 4
             7:          3                 103 5
             8:          2                  91 2
             9:          3                  98 8
            12:          1                 106 1
            13:          1                  99 7
            14:          1                 105 3
]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call methods on the LongsSketch object. Let's get estimates of the frequencies of the values 1 and 9.

We can fetch all cardinalities for all the vertices using the following code:

final GetElements query = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final CloseableIterable<? extends Element> edges = graph.execute(query, user);
final Element edge = edges.iterator().next();
final com.yahoo.sketches.frequencies.LongsSketch longsSketch = (com.yahoo.sketches.frequencies.LongsSketch) edge.getProperty("longsSketch");
final String estimates = "Edge A-B: 1L seen approximately " + longsSketch.getEstimate(1L)
        + " times, 9L seen approximately " + longsSketch.getEstimate(9L) + " times.";

The results are as follows. As 1000 edges were generated with a long randomly sampled from 0 to 9 then the occurrence of each is approximately 100.

Edge A-B: 1L seen approximately 106 times, 9L seen approximately 98 times.

Union

The code for this example is UnionSketch.

This example demonstrates how the Union sketch from the Data Sketches library can be used to maintain estimates of the cardinalities of sets. The Union sketch is similar to a HyperLogLogPlusPlus, but it can also be used to create the intersections of sets. We give an example of how this can be used to monitor the changes to the number of edges in the graph over time.

Properties class: com.yahoo.sketches.theta.Union

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.sketches.datasketches.theta.serialisation.UnionSerialiser

Elements schema

This is our new elements schema. The edge has properties called 'startDate' and 'endDate'. These will be set to the midnight before the time of the occurrence of the edge and to midnight after the time of the occurrence of the edge. There is also a size property which will be a Union. This property will be aggregated over the 'groupBy' properties of startDate and endDate.

{
  "entities": {
    "size": {
      "vertex": "vertex.string",
      "properties": {
        "startDate": "date.earliest",
        "endDate": "date.latest",
        "size": "union"
      },
      "groupBy": [
        "startDate",
        "endDate"
      ]
    }
  },
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "startDate": "date.earliest",
        "endDate": "date.latest",
        "count": "long.count"
      },
      "groupBy": [
        "startDate",
        "endDate"
      ]
    }
  }
}

Types schema

We have added a new type - 'union'. This is a com.yahoo.sketches.theta.Union object. We also added in the serialiser and aggregator for the Union object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "date.earliest": {
      "class": "java.util.Date",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ],
      "aggregateFunction": {
        "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Min"
      }
    },
    "date.latest": {
      "class": "java.util.Date",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ],
      "aggregateFunction": {
        "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Max"
      }
    },
    "long.count": {
      "class": "java.lang.Long",
      "aggregateFunction": {
        "class": "uk.gov.gchq.koryphe.impl.binaryoperator.Sum"
      }
    },
    "union": {
      "class": "com.yahoo.sketches.theta.Union",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.theta.binaryoperator.UnionAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.theta.serialisation.UnionSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

1000 different edges were added to the graph for the day 09/01/2017 (i.e. the startDate was the midnight at the start of the 9th, and the endDate was the midnight at the end of the 9th). For each edge, an Entity was created, with a vertex called "graph". This contained a Union object to which a string consisting of the source and destination was added. 500 edges were added to the graph for the day 10/01/2017. Of these, 250 were the same as edges that had been added in the previous day, but 250 were new. Again, for each edge, an Entity was created for the vertex called "graph".

Here is the Entity for the different days:

Entity[vertex=graph,group=size,properties=Properties[size=<com.yahoo.sketches.theta.UnionImpl>com.yahoo.sketches.theta.UnionImpl@1d75e7af,endDate=<java.util.Date>Tue Jan 10 00:00:00 GMT 2017,startDate=<java.util.Date>Mon Jan 09 00:00:00 GMT 2017]]
Entity[vertex=graph,group=size,properties=Properties[size=<com.yahoo.sketches.theta.UnionImpl>com.yahoo.sketches.theta.UnionImpl@34b27915,endDate=<java.util.Date>Wed Jan 11 00:00:00 GMT 2017,startDate=<java.util.Date>Tue Jan 10 00:00:00 GMT 2017]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call a method on the Union object:

final GetAllElements getAllEntities2 = new GetAllElements.Builder()
        .view(new View.Builder()
                .entity("size")
                .build())
        .build();
final CloseableIterable<? extends Element> allEntities2 = graph.execute(getAllEntities2, user);
final CloseableIterator<? extends Element> it = allEntities2.iterator();
final Element entityDay1 = it.next();
final CompactSketch sketchDay1 = ((Union) entityDay1.getProperty("size")).getResult();
final Element entityDay2 = it.next();
final CompactSketch sketchDay2 = ((Union) entityDay2.getProperty("size")).getResult();
final double estimateDay1 = sketchDay1.getEstimate();
final double estimateDay2 = sketchDay2.getEstimate();

The result is:

1000.0
500.0

Now we can get an estimate for the number of edges in common across the two days:

final Intersection intersection = Sketches.setOperationBuilder().buildIntersection();
intersection.update(sketchDay1);
intersection.update(sketchDay2);
final double intersectionSizeEstimate = intersection.getResult().getEstimate();

The result is:

250.0

We now get an estimate for the number of edges in total across the two days, by simply aggregating overall the properties:

final GetAllElements getAllEntities = new GetAllElements.Builder()
        .view(new View.Builder()
                .entity("size", new ViewElementDefinition.Builder()
                        .groupBy() // set the group by properties to 'none'
                        .build())
                .build())
        .build();
final CloseableIterable<? extends Element> allEntities = graph.execute(getAllEntities, user);
final Element entity = allEntities.iterator().next();
final double unionSizeEstimate = ((Union) entity.getProperty("size")).getResult().getEstimate();

The result is:

1250.0

ReservoirItemsUnion

The code for this example is ReservoirItemsUnion.

This example demonstrates how the ReservoirItemsUnion sketch from the Data Sketches library can be used to maintain estimates of properties on vertices and edges. The ReservoirItemsUnion sketch allows a sample of a set of strings to be maintained. We give two examples of this. The first is if when an edge is observed there is a string property associated to it, and there are a lot of different values of that string. We may not want to store all the different values of the string, but we may want to see a random sample of them. The second example is to store on an Entity a sketch which gives a sample of the vertices that are connected to the vertex. Even if we are storing all the edges then producing a random sample of the vertices attached to a vertex may not be quick (for example if a vertex has degree 10,000 then producing a sample of a random 10 neighbours would require scanning all the edges - storing the sketch on the Entity means that the sample will be precomputed and can be returned without scanning the edges).

Properties class: com.yahoo.sketches.sampling.ReservoirItemsUnion

Predicates:

Aggregators:

To Bytes Serialisers:

Elements schema

This is our new elements schema. The edge has a property called 'stringsSample'. This will store the ReservoirItemsUnion object. The entity has a property called 'neighboursSample'. This will also store a ReservoirItemsUnion object.

{
  "entities": {
    "blueEntity": {
      "vertex": "vertex.string",
      "properties": {
        "neighboursSample": "reservoir.strings.union"
      }
    }
  },
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "stringsSample": "reservoir.strings.union"
      }
    },
    "blue": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false"
    }
  }
}

Types schema

We have added a new type - 'reservoir.strings.union'. This is a com.yahoo.sketches.sampling.ReservoirItemsUnion object. We also added in the serialiser and aggregator for the ReservoirItemsUnion object. Gaffer will automatically aggregate these sketches, using the provided aggregator, so they will keep up to date as new edges are added to the graph.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "reservoir.strings.union": {
      "class": "com.yahoo.sketches.sampling.ReservoirItemsUnion",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.sampling.binaryoperator.ReservoirItemsUnionAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.sketches.datasketches.sampling.serialisation.ReservoirStringsUnionSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

An edge A-B of group "red" was added to the graph 1000 times. Each time it had the stringsSample property containing a randomly generated string. Here is the edge:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[stringsSample=<com.yahoo.sketches.sampling.ReservoirItemsUnion>
### ReservoirItemsUnion SUMMARY: 
   Max k: 20
   Gadget summary: 
### ReservoirItemsSketch SUMMARY: 
   k            : 20
   n            : 1000
   Current size : 20
   Resize factor: X8
### END SKETCH SUMMARY
### END UNION SUMMARY
]]

This is not very illuminating as this just shows the default toString() method on the sketch. To get value from it we need to call a method on the ReservoirItemsUnion object:

final GetElements query = new GetElements.Builder()
        .input(new EdgeSeed("A", "B", DirectedType.UNDIRECTED))
        .build();
final CloseableIterable<? extends Element> edges = graph.execute(query, user);
final Element edge = edges.iterator().next();
final ReservoirItemsSketch<String> stringsSketch = ((com.yahoo.sketches.sampling.ReservoirItemsUnion) edge.getProperty("stringsSample"))
        .getResult();
final String[] samples = stringsSketch.getSamples();
final StringBuilder sb = new StringBuilder("10 samples: ");
for (int i = 0; i < 10 && i < samples.length; i++) {
    if (i > 0) {
        sb.append(", ");
    }
    sb.append(samples[i]);
}

The results contain a random sample of the strings added to the edge:

10 samples: BIBFBBIDCJ, JIACFBDHAH, DJJDEDAFDH, HEGGBJDBHG, FGJJDFEBAG, IHFIGAJHJI, BJICHHAFFE, JAIJDCFDHD, BJHBGHBGHH, ACHCDCJFGE

500 edges of group "blue" were also added to the graph (edges X-Y0, X-Y1, ..., X-Y499). For each of these edges, an Entity was created for both the source and destination. Each Entity contained a 'neighboursSample' property that contains the vertex at the other end of the edge. We now get the Entity for the vertex X and display the sample of its neighbours:

final GetElements query2 = new GetElements.Builder()
        .input(new EntitySeed("X"))
        .build();
final CloseableIterable<? extends Element> entities = graph.execute(query2, user);
final Element entity = entities.iterator().next();
final ReservoirItemsSketch<String> neighboursSketch = ((com.yahoo.sketches.sampling.ReservoirItemsUnion) entity.getProperty("neighboursSample"))
        .getResult();
final String[] neighboursSample = neighboursSketch.getSamples();
sb.setLength(0);
sb.append("10 samples: ");
for (int i = 0; i < 10 && i < neighboursSample.length; i++) {
    if (i > 0) {
        sb.append(", ");
    }
    sb.append(neighboursSample[i]);
}

The results are:

10 samples: Y462, Y2, Y319, Y194, Y142, Y457, Y449, Y470, Y467, Y444

RBMBackedTimestampSet

The code for this example is TimestampSet.

This example demonstrates how the TimestampSet property can be used to maintain a set of the timestamps at which an element was seen active. In this example we record the timestamps to minute level accuracy, i.e. the seconds are ignored.

Properties class: uk.gov.gchq.gaffer.time.RBMBackedTimestampSet

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.time.serialisation.RBMBackedTimestampSetSerialiser

Elements schema

This is our new elements schema. The edge has a property called 'timestampSet'. This will store the TimestampSet object, which is actually a 'RBMBackedTimestampSet'.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "timestampSet": "timestamp.set"
      }
    }
  }
}

Types schema

We have added a new type - 'timestamp.set'. This is a uk.gov.gchq.gaffer.time.RBMBackedTimestampSet object. We also added in the serialiser and aggregator for the RBMBackedTimestampSet object. Gaffer will automatically aggregate these sets together to maintain a set of all the times the element was active.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "timestamp.set": {
      "class": "uk.gov.gchq.gaffer.time.RBMBackedTimestampSet",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.time.binaryoperator.RBMBackedTimestampSetAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.time.serialisation.RBMBackedTimestampSetSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

Only one edge is in the graph. This was added 25 times, and each time it had the 'timestampSet' property containing a randomly generated timestamp from 2017. Here is the Edge:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[timestampSet=<uk.gov.gchq.gaffer.time.RBMBackedTimestampSet>RBMBackedTimestampSet[timeBucket=MINUTE,timestamps=2017-01-08T07:29:00Z,2017-01-18T10:41:00Z,2017-01-19T01:36:00Z,2017-01-31T16:16:00Z,2017-02-02T08:06:00Z,2017-02-12T14:21:00Z,2017-02-15T22:01:00Z,2017-03-06T09:03:00Z,2017-03-21T18:09:00Z,2017-05-08T15:34:00Z,2017-05-10T19:39:00Z,2017-05-16T10:44:00Z,2017-05-23T10:02:00Z,2017-05-28T01:52:00Z,2017-06-24T23:50:00Z,2017-07-27T09:34:00Z,2017-08-05T02:11:00Z,2017-09-07T07:35:00Z,2017-10-01T12:52:00Z,2017-10-23T22:02:00Z,2017-10-27T04:12:00Z,2017-11-01T02:45:00Z,2017-12-11T16:38:00Z,2017-12-22T14:40:00Z,2017-12-24T08:00:00Z]]]

You can see the list of timestamps on the edge. We can also get just the earliest, latest and total number of timestamps using methods on the TimestampSet object to get the following results:

Edge A-B was first seen at 2017-01-08T07:29:00Z, last seen at 2017-12-24T08:00:00Z, and there were 25 timestamps it was active.

BoundedTimestampSet

The code for this example is BoundedTimestampSet.

This example demonstrates how the BoundedTimestampSet property can be used to maintain a set of the timestamps at which an element was seen active. If this set becomes larger than a size specified by the user then a uniform random sample of the timestamps is maintained. In this example we record the timestamps to minute level accuracy, i.e. the seconds are ignored, and specify that at most 25 timestamps should be retained.

Properties class: uk.gov.gchq.gaffer.time.BoundedTimestampSet

Predicates:

Aggregators:

To Bytes Serialisers:

uk.gov.gchq.gaffer.time.serialisation.BoundedTimestampSetSerialiser

Elements schema

This is our new schema. The edge has a property called 'boundedTimestampSet'. This will store the BoundedTimestampSet object, which is actually a 'BoundedTimestampSet'.

{
  "edges": {
    "red": {
      "source": "vertex.string",
      "destination": "vertex.string",
      "directed": "false",
      "properties": {
        "boundedTimestampSet": "bounded.timestamp.set"
      }
    }
  }
}

Types schema

We have added a new type - 'bounded.timestamp.set'. This is a uk.gov.gchq.gaffer.time.BoundedTimestampSet object. We have added in the serialiser and aggregator for the BoundedTimestampSet object. Gaffer will automatically aggregate these sets together to maintain a set of all the times the element was active. Once the size of the set becomes larger than 25 then a uniform random sample of size at most 25 of the timestamps is maintained.

{
  "types": {
    "vertex.string": {
      "class": "java.lang.String",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.Exists"
        }
      ]
    },
    "bounded.timestamp.set": {
      "class": "uk.gov.gchq.gaffer.time.BoundedTimestampSet",
      "aggregateFunction": {
        "class": "uk.gov.gchq.gaffer.time.binaryoperator.BoundedTimestampSetAggregator"
      },
      "serialiser": {
        "class": "uk.gov.gchq.gaffer.time.serialisation.BoundedTimestampSetSerialiser"
      }
    },
    "false": {
      "class": "java.lang.Boolean",
      "validateFunctions": [
        {
          "class": "uk.gov.gchq.koryphe.impl.predicate.IsFalse"
        }
      ]
    }
  }
}

There are two edges in the graph. Edge A-B was added 3 times, and each time it had the 'boundedTimestampSet' property containing a randomly generated timestamp from 2017. Edge A-C was added 1000 times, and each time it also had the 'boundedTimestampSet' property containing a randomly generated timestamp from 2017. Here are the edges:

Edge[source=A,destination=B,directed=false,group=red,properties=Properties[boundedTimestampSet=<uk.gov.gchq.gaffer.time.BoundedTimestampSet>BoundedTimestampSet[timeBucket=MINUTE,state=NOT_FULL,maxSize=25,timestamps=2017-02-12T14:21:00Z,2017-03-21T18:09:00Z,2017-12-24T08:00:00Z]]]
Edge[source=A,destination=C,directed=false,group=red,properties=Properties[boundedTimestampSet=<uk.gov.gchq.gaffer.time.BoundedTimestampSet>BoundedTimestampSet[timeBucket=MINUTE,state=SAMPLE,maxSize=25,timestamps=2017-03-12T05:27:00Z,2017-03-12T19:14:00Z,2017-03-20T06:52:00Z,2017-04-06T13:29:00Z,2017-04-20T15:20:00Z,2017-04-22T18:37:00Z,2017-04-28T23:45:00Z,2017-05-02T03:42:00Z,2017-05-25T04:20:00Z,2017-05-25T19:45:00Z,2017-06-22T17:04:00Z,2017-06-27T06:10:00Z,2017-07-20T20:25:00Z,2017-07-26T10:39:00Z,2017-08-01T18:58:00Z,2017-08-28T08:08:00Z,2017-09-01T01:42:00Z,2017-10-14T12:54:00Z,2017-11-13T03:42:00Z,2017-11-30T23:18:00Z,2017-12-01T08:23:00Z,2017-12-09T17:50:00Z,2017-12-10T17:37:00Z,2017-12-13T12:03:00Z,2017-12-26T12:14:00Z]]]

You can see that edge A-B has the full list of timestamps on the edge, but edge A-C has a sample of the timestamps.

p013570 commented 6 years ago

Merged into develop.

gchq / Gaffer