bug(c++,spark): clean the output directory when generating data in unit test

apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

https://graphar.apache.org/

Apache License 2.0

226 stars 46 forks source link

bug(c++,spark): clean the output directory when generating data in unit test #584

Open acezen opened 3 months ago

acezen commented 3 months ago

Describe the bug, including details regarding any error messages, version, and platform.

In c++ or spark unit test, we usually creating the graphar data to /tmp dir and may check the generated file num with an assert. But the c++ and spark unit test may generate useless files for each other and make the assertion failed.

I suggest we can clean the output directory before write out the files in unit test of c++ and spark.

Solution

As Sem suggested, we can making the clean operation as a part of the top-level make clean of C++/Spark library.

Component(s)

C++, Spark

SemyonSinchenko commented 3 months ago

What do you think about making it a part of the top-level make clean command?

acezen commented 3 months ago

What do you think about making it a part of the top-level make clean command?

Good advice and that apply to c++ too!

SumitkumarSatpute commented 3 months ago

I came across this issue and would love to help out. Is there any additional information or context I should be aware of before I get started?

Looking forward to contributing!

acezen commented 3 months ago

I came across this issue and would love to help out. Is there any additional information or context I should be aware of before I get started?

Looking forward to contributing!

Hi, @SumitkumarSatpute , thanks for the interest to GraphAr. the generated temporary data is generated by unit tests of write:

c++
- test_builder.cc
- test_arrow_chunk_writer.cc
spark:
- testGraphWriter
- testWrtier

So I think you need to clean the output directories like /tmp/vertex, /tmp/edge , /tmp/ldbc base on the unit tests and clean them with the top level make clean of the libraries.

Feel free to ask if you have any question and enjoy the trip:)

SemyonSinchenko commented 3 months ago

I see it in the following way:

we have Makefiles in each of subproject (cpp, maven, pyspark already has one);
each Makefile of the subproject contains a clean command that delete all the created temporary data, all the generated code, all the compiled classes, etc. For maven it should be something like mvn clean and also deleting of the corresponded tmp folder and downloaded artifacts like spark binaries;
we have a top level Makefile with a clean command that just runs one by one clean in subprojects

acezen commented 3 months ago

I see it in the following way:

we have Makefiles in each of subproject (cpp, maven, pyspark already has one);

each Makefile of the subproject contains a clean command that delete all the created temporary data, all the generated code, all the compiled classes, etc. For maven it should be something like mvn clean and also deleting of the corresponded tmp folder and downloaded artifacts like spark binaries;

we have a top level Makefile with a clean command that just runs one by one clean in subprojects

Good supplement, thanks Sem.

SumitkumarSatpute commented 3 months ago

Please let me know how to reproduce this scenario in case of C++ , SPARK or others on this matter.

SemyonSinchenko commented 3 months ago

For maven it is enough to run tests like they are running in CI: mvn test