apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.
https://graphar.apache.org/
Apache License 2.0
225 stars 46 forks source link

weird source code structure (maybe bug?) #593

Closed yecol closed 3 months ago

yecol commented 3 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Screenshot 2024-08-14 at 4 29 59 PM

Just check out the latest code and find something weird. As shown in the above figure, it seems some code is under dir datasources-34 and datasources-35?

Component(s)

Spark

SemyonSinchenko commented 3 months ago

@yecol Our target is to support multiple versions of Apache Spark. Unfortunately, the DataSource API of Apache Spark is a Developer API and changing dramatically from one version of spark to another. And sometimes changes are so big, that reflection is not enough.

We made a decision to separate datasource implementation into a maven subpackage.

And we have the following maven profiles:

    <profiles>
        <profile>
            <id>datasources-32</id>
            <properties>
                <sbt.project.name>graphar</sbt.project.name>
                <spark.version>3.2.4</spark.version>
            </properties>
            <modules>
                <module>graphar</module>
                <module>datasources-32</module>
            </modules>
        </profile>
        <profile>
            <id>datasources-33</id>
            <properties>
                <sbt.project.name>graphar</sbt.project.name>
                <spark.version>3.3.4</spark.version>
            </properties>
            <modules>
                <module>graphar</module>
                <module>datasources-33</module>
            </modules>
        </profile>
        <profile>
            <id>datasources-34</id>
            <properties>
                <sbt.project.name>graphar</sbt.project.name>
                <spark.version>3.4.3</spark.version>
            </properties>
            <modules>
                <module>graphar</module>
                <module>datasources-34</module>
            </modules>
        </profile>
        <profile>
            <id>datasources-35</id>
            <properties>
                <sbt.project.name>graphar</sbt.project.name>
                <spark.version>3.5.1</spark.version>
            </properties>
            <modules>
                <module>graphar</module>
                <module>datasources-35</module>
            </modules>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
        </profile>
    </profiles>

Each of subfolders is actually a subproject in Maven.

so, using that approach we are able to build GraphAr Spark for a different version of spark itself.

At the moment, that approach is used in our CI when we are running tests for all the supported Maven profiles.

SemyonSinchenko commented 3 months ago

An alternative way to provide the support is to use tags/branches. But for me it is better to have Meven sub-projects. At the random moment of time about 4-5 versions of spark are maintained, so I don't think that amount of duplicated code will grow infinitely: spark-3.2 is EoL soon, for example, so we can drop it, etc.

yecol commented 3 months ago

I see. It makes sense! I didn't aware the diverged datasource versions of Spark. Thanks for your kindly and detailed response!