apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.
https://graphar.apache.org/
Apache License 2.0
195 stars 40 forks source link

[POC]feat(format): Add format protocol definitions files with google protocol buffers #475

Closed acezen closed 3 weeks ago

acezen commented 1 month ago

Reason for this PR

as issue #73 describes, currently the C++ library, Java, and scala library implement the format independently and they may have some miss-match between the implementation. It's better to standardize the format with protobuf and the libraries rely on the definition.

The standardized format definition would bring much benefits:

But the changes may break the breaking changes to current public APIs if we adapt the implementation to the format protocol.

What changes are included in this PR?

Are these changes tested?

yes

Are there any user-facing changes?

not yet

not yet

SemyonSinchenko commented 1 month ago

Based on my experience with Apache Spark itself, protobuf-way may be a painful story:

Btw, my first question is, are we able to use buf? (https://buf.build/docs/introduction)? Apache Spark itself uses buf because it significantly simplifies the process. I guess there shouldn't be a conflict with ASF rules. If we can, I would like to recommend to create buf.work.yaml and buf.gen.yaml. You may check examples in Apache Spark (https://github.com/apache/spark/tree/master/connector/connect/common/src/main), or I can do it because I had some experience with it;

My second question, are we going to store generated code in git or not? Apache Spark uses a mixed approach, when they generate JVM-classes on the fly (via protobuf maven plugin) but store generated py files in git. For python they have a cool solution: https://github.com/apache/spark/blob/master/dev/connect-gen-protos.sh, an auto formatter of generated code, that allows to have a readable diff in git PRs;

Based on my experience, Data Science / Network Analysis people and Data Engineers, who are our target auditory, are not familiar with protobuf, so we should also extends the documentation itself.

P.S. It is nice that I was stuck in my work and did not write a lot of code for a new Python API. Because otherwise I would need to rewrite it again with proto :)

SemyonSinchenko commented 1 month ago

@acezen Because it is a very huge change, should we discuss it first in a mailing list? Like voting. Because it is a really big change, pros and cons are known too.

acezen commented 1 month ago

@acezen Because it is a very huge change, should we discuss it first in a mailing list? Like voting. Because it is a really big change, pros and cons are known too.

Hi, @SemyonSinchenko sorry for the misleading,this PR is just a example or show code for format definition. As you said, it should be and I will open a discussion about the proposal and changes (may need some time to list the pros and cons about the protobuf). The goal I want to achieve is we should have some strategies to standardize the format definition, easy to maintain and evolve, make it cross-language compatibility.

acezen commented 1 month ago

Based on my experience with Apache Spark itself, protobuf-way may be a painful story:

  • you need to incorporate it into CI;
  • the generated code is huge (for java it may be thousands of LOC);
  • the generated code is unreadable;
  • debugging is very hard;
  • we need to decide, are we going to store the generated code in git or not (there are pros and cons);

Btw, my first question is, are we able to use buf? (https://buf.build/docs/introduction)? Apache Spark itself uses buf because it significantly simplifies the process. I guess there shouldn't be a conflict with ASF rules. If we can, I would like to recommend to create buf.work.yaml and buf.gen.yaml. You may check examples in Apache Spark (https://github.com/apache/spark/tree/master/connector/connect/common/src/main), or I can do it because I had some experience with it;

My second question, are we going to store generated code in git or not? Apache Spark uses a mixed approach, when they generate JVM-classes on the fly (via protobuf maven plugin) but store generated py files in git. For python they have a cool solution: https://github.com/apache/spark/blob/master/dev/connect-gen-protos.sh, an auto formatter of generated code, that allows to have a readable diff in git PRs;

Based on my experience, Data Science / Network Analysis people and Data Engineers, who are our target auditory, are not familiar with protobuf, so we should also extends the documentation itself.

P.S. It is nice that I was stuck in my work and did not write a lot of code for a new Python API. Because otherwise I would need to rewrite it again with proto :)

Thanks Sem, the comment and advice is very helpful and insightful, buf is one solution and we do need to discuss about it!

SemyonSinchenko commented 1 month ago

May we add buf as a building system? It is under Apache 2.0 and actually we need only binaries. https://github.com/bufbuild/buf

acezen commented 1 month ago

May we add buf as a building system? It is under Apache 2.0 and actually we need only binaries. https://github.com/bufbuild/buf

Good idea, I will give a try.

SemyonSinchenko commented 3 weeks ago

@acezen May we merge it into a separate branch? I want to try to add buf and some additional scripts, especially for python part.

acezen commented 3 weeks ago

@acezen May we merge it into a separate branch? I want to try to add buf and some additional scripts, especially for python part.

Sure, I can create a branch for the format and merge this to the branch.

acezen commented 3 weeks ago

@acezen May we merge it into a separate branch? I want to try to add buf and some additional scripts, especially for python part.

Hi, Sem, I have created a format-definition-dev branch and merged the format definition into it. You can develop base on the branch.

acezen commented 3 weeks ago

The change has merge into format-definition-dev branch for continue development, close this PR.