eclipse-edc / Technology-Gcp

Apache License 2.0
3 stars 6 forks source link

feat: BigQuery data plane #178

Closed man8pr closed 2 months ago

man8pr commented 4 months ago

What this PR changes/adds

With this PR, data exchange with BigQuery is possible, using the provider extension already available, and the present data plane extension.

Linked Issue(s)

Closes #160

man8pr commented 3 months ago

So far I've only looked at the BigQuerySinkServiceImpl where I found a significant amount of issues already, so I didn't see a point in continuing the review. Please find an overview below and inline in code.

I think the first step should be a written architectural design proposal document that outlines the key classes and objects as well as their functionality, etc. This could later serve as documentation, too. Once that is done and approved the implementation can begin.

General BigQuerySinkServiceImpl remarks:

  • don't use org.json.JSONArray, use existing EDC infrastructure, i.e. jakarta.JsonArray instead.
  • pls no second JSON implementation (GSON). I realize Google has its own, but we're using Jackson, so pls stick to that if possible.
  • general formatting issues: constructors go first, then come public methods, then protected, then private. Nested classes go last. Also, please don't cut off code at weird positions. Pls check out the EDC code style for IntelliJ
  • individual parts can be materialized in memory. If the structure of a JSON document is relevant, then (real) streaming is not possible anyway.
  • interfaces of injectable services should go in an spi package.
  • the Builder is not needed, since it's a service-type object, and has a small number of collaborator objects (i.e. CTor params). also, the Builder breaks our parameterless newInstance() paradigm.
  • do not use hardcoded numbers and magic strings, e.g. for timeouts, max repeats etc. Those should come from a configuration.
  • the pattern for reading config values is:

    • the extension class reads values using ServiceExtensionContext#getConfig()
    • if there are many config values, they should be bundled in a BigQueryClientConfiguration object
    • either the raw values, or the config object should be passed into the service class
  • there is virtually no error handling, especially error reporting to higher layers. The Sink only reports an error if an exception occurs, but transmission errors, or other SDK-originated errors are only logged to the monitor.
  • why is the BigQuerySinkService needed? it seems, it and the BigQueryDataSink do mostly the same thing. This is a hint at a flawed design. A better approach could be to have the sink delegate each part to the BigQuerySinkService and handle each resulting error individually
  • there is no unit test for the BigQuerySinkServiceImpl

I try to consolidate the source- and sink-related class to simplify the design. Re org.json.JSONArray , its use is required, because Google APIs are using it. While GSON can be avoided, I suggest to consider in future a different degree of freedom when using working in a tech repo, compared to the shared repos: it could be still reasonable to choose a different solution for a specific provider, when a tech-specific solution is available.