O2-Czech-Republic / proxima-platform

The Proxima platform.
Apache License 2.0
19 stars 7 forks source link
analytical-platform apache-beam apache-flink apache-spark batch-processing data-mesh iot-platform stream-processing unified-data-processing

Build Status sonar sonar Maven Version

The Proxima platform

The platform is a generic data ingestion, manipulation and retrieval framework. High level can be described by following scheme:

high-level scheme

Design document

High level design document can be found here.

Incomplete (under construction) documentation

Not yet complete documentation can be found here. The documentation should grow over over time to cover all the aspects of the platform. PRs welcome!

Scheme definition

First, let's introduce some glossary:

Example scheme definition

The scheme definition uses HOCON. As a short example we will show definition of data processing of a hypothetic e-commerce site. The site has some goods, some users and generates some events which describe how users interact with the goods. We will use protocol buffers for serialization.

First, let's define our data model. We will model the system which processes events coming from some source in given format and based on these events creates a model of user preferences.

 entities {
   # user entity, let's make this really simple
   user {
     attributes {

       # some details of user - e.g. name, email, ...
       details { scheme: "proto:cz.o2.proxima.example.Example.UserDetails" }

       # model of preferences based on events
       preferences { scheme: "proto:cz.o2.proxima.example.Example.UserPreferences" }

       # selected events are stored to user's history
       "event.*" { scheme: "proto:cz.o2.proxima.example.Example.BaseEvent" }

     }
   }
   # entity describing a single good we want to sell
   product {
     # note: we have to split to separate attributes each attribute that we want to be able
     # to update *independently*
     attributes {

       # price, with some possible additional information, like VAT and other stuff
       price { scheme: "proto:cz.o2.proxima.example.Example.Price" }

       # some general details of the product
       details { scheme: "proto:cz.o2.proxima.example.Example.ProductDetails" }

       # list of associated categories
       "category.*" { scheme: "proto:cz.o2.proxima.example.Example.ProductCategory" }

     }
   }

   # the events which link users to goods
   event {
     attributes {

       # the event is atomic entity with just a single attribute
       data { scheme: "proto:cz.o2.proxima.example.Example.BaseEvent" }

     }
   }

 }

Next, after defining our data model, we need to specify attribute families for our entities. This definition is highly dependent on the access pattern to the data. Mostly, we have to worry about how are we going to read our data. Relevant questions are:

Platform's data model

Generally, data are modelled as unbounded stream of updates to attributes of entities. Each update consists of the following:

Compiling scheme definition to access classes

The platform contains maven compiler of scheme specification to java access classes as follows:

      <plugin>
        <groupId>cz.o2.proxima</groupId>
        <artifactId>proxima-compiler-java-maven-plugin</artifactId>
        <version>0.14.0</version>
        <configuration>
          <outputDir>${project.build.directory}/generated-sources/model</outputDir>
          <javaPackage>cz.o2.proxima.testing.model</javaPackage>
          <className>Model</className>
          <config>${basedir}/src/main/resources/test-readme.conf</config>
        </configuration>
        <executions>
          <execution>
            <phase>generate-sources</phase>
            <goals>
              <goal>compile</goal>
            </goals>
          </execution>
        </executions>
        <dependencies>
          <!--
            Use direct data operator access, see later
          -->
          <dependency>
            <groupId>${project.groupId}</groupId>
            <artifactId>proxima-direct-compiler-plugin</artifactId>
            <version>0.14.0</version>
          </dependency>
          <!--
            The following dependencies define additional
            dependencies for this example
          -->
          <dependency>
            <groupId>${project.groupId}</groupId>
            <artifactId>proxima-core</artifactId>
            <version>${project.version}</version>
            <classifier>tests</classifier>
          </dependency>
          <dependency>
            <groupId>${project.groupId}</groupId>
            <artifactId>proxima-scheme-proto</artifactId>
            <version>0.14.0</version>
          </dependency>
          <dependency>
            <groupId>${project.groupId}</groupId>
            <artifactId>proxima-scheme-proto-testing</artifactId>
            <version>0.14.0</version>
          </dependency>
        </dependencies>
      </plugin>

This plugin then generates class cz.o2.proxima.testing.model.Model into target/generated-sources/model. The class can be instantiated via

   Model model = Model.of(ConfigFactory.defaultApplication());

or (in case of tests, where some validations and initializations are skipped)

   Model model = Model.ofTest(ConfigFactory.defaultApplication());

Platform's DataOperators

The platform offers various modes of access to data. As of version 0.14.0, these types are:

Apache Beam access to data

First, create BeamDataOperator as follows:

   BeamDataOperator operator = model.getRepo().getOrCreateOperator(BeamDataOperator.class);

Next, use this operator to create PCollection from Model.

   // some imports omitted, including these for clarity
   import org.apache.beam.sdk.Pipeline;
   import org.apache.beam.sdk.transforms.Count;
   import org.apache.beam.sdk.transforms.WithKeys;
   import org.apache.beam.sdk.transforms.windowing.AfterWatermark;
   import org.apache.beam.sdk.transforms.windowing.FixedWindows;
   import org.apache.beam.sdk.transforms.windowing.Window;
   import org.apache.beam.sdk.values.KV;
   import org.apache.beam.sdk.values.PCollection;
   import org.joda.time.Duration;

   Pipeline pipeline = Pipeline.create();
   PCollection<StreamElement> input = operator.getStream(
       pipeline, Position.OLDEST, false, true,
       model.getEvent().getDataDescriptor());
   PCollection<KV<String, Long>> counted =
       input
           .apply(
               Window.<StreamElement>into(FixedWindows.of(Duration.standardMinutes(1)))
                   .triggering(AfterWatermark.pastEndOfWindow())
                   .discardingFiredPanes())
           .apply(
               WithKeys.of(
                   el ->
                       model
                           .getEvent()
                           .getDataDescriptor()
                           .valueOf(el)
                           .map(BaseEvent::getProductId)
                           .orElse("")))
           .apply(Count.perKey());

   // do something with the output

Online Java docs

Build notes

CI is run only against changed modules (and its dependents) in pull requests. To completely rebuild the whole project in a PR push a commit with commit message 'rebuild'. After the build, you can squash and remove the commit.