Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
210 stars 19 forks source link

Issue 288: Add log messages #289

Closed fpj closed 5 months ago

fpj commented 6 months ago

Description

It adds log messages to be able to trace activity when executing such as the following:

spark_data.write.format("qbeast")
            .mode("append")
            .options(**qbeast_options)
            .save(output_path)

The code at the moment is very scarce on log messages, in factr, I couldn't find any in the part of the code I was inspecting. It is going to take multiple PRs to populate the code base with logging messages appropriately.

Fixes #288

Type of change

It is adding log messages using the spark internal logging trait. As such, we have to add a line to specific classes to have them using org.apache.spark.internal.Logging.

In the log messages I added so far, I have prefixed it with Qbeast: to make it easier to locate the messages among all INFO messages that Spark generated. That's not ideal, and I'd much rather have the logs printing the fully-qualified class name. We might want to change it before merging.

Checklist:

Here is the list of things you should do before submitting this pull request:

How Has This Been Tested? (Optional)

I have validated locally with spark-shell that the messages are printing.

Test Configuration:

codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 95.83333% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 90.77%. Comparing base (277b170) to head (f3b476f). Report is 1 commits behind head on main.

Files Patch % Lines
...ala/io/qbeast/spark/delta/writer/BlockWriter.scala 66.66% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #289 +/- ## ========================================== - Coverage 90.78% 90.77% -0.01% ========================================== Files 98 98 Lines 2604 2623 +19 Branches 346 336 -10 ========================================== + Hits 2364 2381 +17 - Misses 240 242 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

fpj commented 6 months ago

With the recent commit, the output looks like this:

24/03/26 12:29:53 INFO QbeastDataSource: Getting Qbeast table /path/nyc-qbeast-table
24/03/26 12:29:53 TRACE QbeastDataSource: Start: Create Qbeast relation /path/nyc-qbeast-table
24/03/26 12:29:53 TRACE IndexedTableImpl: Start: save table /path/nyc-qbeast-table
24/03/26 12:29:53 TRACE IndexedTableImpl: Begin: Writing data to table /path/nyc-qbeast-table
24/03/26 12:29:53 DEBUG IndexedTableImpl: Writing data to table /path/nyc-qbeast-table, with revision 0
24/03/26 12:29:53 TRACE IndexedTableImpl: Begin: Writing data to table /path/nyc-qbeast-table
24/03/26 12:29:53 TRACE SparkOTreeManager: Begin: index with revision Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector())
24/03/26 12:29:53 TRACE DoublePassOTreeDataAnalyzer: Begin: analyze for index Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector())
24/03/26 12:29:54 DEBUG DoublePassOTreeDataAnalyzer: Computing partition-level cube domains for index Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector()), Revision(1,1711452594393,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector(LinearTransformation(-750.0,386983.63,198550.18258925626,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformation(0.0,185250.45,46861.16844627305,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@1a4589e9))), false
24/03/26 12:29:54 DEBUG DoublePassOTreeDataAnalyzer: Computing global cube domains for index Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector())
24/03/26 12:30:05 DEBUG DoublePassOTreeDataAnalyzer: Merging global cube domains for index Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector())
24/03/26 12:30:05 DEBUG DoublePassOTreeDataAnalyzer: Estimating cube weights for index Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector()), false
24/03/26 12:30:05 TRACE DoublePassOTreeDataAnalyzer: End: analyze for index Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector())
24/03/26 12:30:05 TRACE SparkOTreeManager: End: index with revision Revision(0,1711452593440,/path/nyc-qbeast-table,50000,Vector(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6821f91b), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@1a4589e9)),Vector())
24/03/26 12:30:20 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/03/26 12:30:21 TRACE IndexedTableImpl: End: Writing data to table /path/nyc-qbeast-table
24/03/26 12:30:21 TRACE IndexedTableImpl: End: Done writing data to table /path/nyc-qbeast-table
24/03/26 12:30:21 TRACE IndexedTableImpl: Done: save table /path/nyc-qbeast-table
24/03/26 12:30:21 TRACE QbeastDataSource: End: Create Qbeast relation /path/nyc-qbeast-table

Note that I have configured log4j to only output Qbeast log messages with severity level TRACE, the severity level for Spark remains as WARN.

fpj commented 5 months ago

On the questions you asked earlier:

I want to understand a bit more the criteria to print the message in INFO, DEBUG or TRACE.

Roughly how I use them:

We should consider having more formal definitions with examples as part of contribution guidelines.

Should the logging be tested?

It is a good point. If the code path where the log message lies is being executed, then the corresponding log method (logDebug, logInfo) is also being executed. But, it is accurate that a test case is not necessarily going to spot whether a message is not useful for a human being or even some automation software that processes logs. If there are downstream systems consuming such logs, then we could consider having integration tests to validate the messages.

osopardo1 commented 5 months ago

On the questions you asked earlier:

I want to understand a bit more the criteria to print the message in INFO, DEBUG or TRACE.

Roughly how I use them:

  • WARN level is supposed to be critical and actionable. If the user sees a WARN, then something bad happened and it might require user intervention.
  • INFO level provides information about the execution, but not necessarily actionable and it avoids being verbose. It is not uncommon to see INFO level on in production, so it is expected to be lightweight with respect to the volume of messages generated.
  • DEBUG provides debug level info when debugging the code. It can be verbose as it is not expected to be on in production.
  • TRACE provides further detail to DEBUG on execution paths, and in particular, it indicates critical the methods that been executed. I like the idea of having begin and end messages so that we see when a method begins and ends.

We should consider having more formal definitions with examples as part of contribution guidelines.

Should the logging be tested?

It is a good point. If the code path where the log message lies is being executed, then the corresponding log method (logDebug, logInfo) is also being executed. But, it is accurate that a test case is not necessarily going to spot whether a message is not useful for a human being or even some automation software that processes logs. If there are downstream systems consuming such logs, then we could consider having integration tests to validate the messages.

Got it! Let me write the first point into an issue.

The test reasoning seems good to me. But now seems like Codecov is not executing those lines... I don't know what is the problem exactly, I am not able to detect a pattern (if it's not printing trace or debug...) and the other parts of the code are being tested as well. Do you know what can be?

osopardo1 commented 5 months ago

Aside from Codecov complains, the PR LGTM 🙌

fpj commented 5 months ago

@osopardo1 here is a sample output:

24/04/04 13:24:32 INFO QbeastDataSource: Getting Qbeast table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:24:32 DEBUG QbeastDataSource: Table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch properties: Map(cubeSize -> 50000, path -> /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch, columnsToIndex -> fare_amount,trip_distance,passenger_count)
24/04/04 13:24:32 TRACE QbeastDataSource: Begin: Create Qbeast relation /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:24:32 TRACE IndexedTableImpl: Begin: save table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:24:32 DEBUG IndexedTableImpl: Appending data to table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch with revision=4
24/04/04 13:24:32 DEBUG IndexedTableImpl: Loading latest revision for table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch with revision=4
24/04/04 13:24:33 TRACE IndexedTableImpl: Begin: Writing data to table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:24:33 DEBUG IndexedTableImpl: Writing data to table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch with revision 4
24/04/04 13:24:33 TRACE IndexedTableImpl: Begin: Writing data to table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:24:33 TRACE SparkOTreeManager: Begin: index with revision Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61)))
24/04/04 13:24:33 TRACE DoublePassOTreeDataAnalyzer: Begin: analyze for index with revision=Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61))) and isReplication=false
24/04/04 13:24:33 DEBUG DoublePassOTreeDataAnalyzer: newRevision=Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61)))
24/04/04 13:24:33 DEBUG DoublePassOTreeDataAnalyzer: Computing global cube domains for index revision Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61)))
24/04/04 13:24:33 WARN DAGScheduler: Broadcasting large task binary with size 1505.5 KiB
24/04/04 13:25:05 WARN DAGScheduler: Broadcasting large task binary with size 1521.6 KiB
24/04/04 13:25:05 DEBUG DoublePassOTreeDataAnalyzer: Merging global cube domains for index revision Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61)))
24/04/04 13:25:05 DEBUG DoublePassOTreeDataAnalyzer: Estimating cube weights for index revision Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61)))
24/04/04 13:25:05 TRACE DoublePassOTreeDataAnalyzer: End: analyze for index with revision Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61)))
24/04/04 13:25:05 TRACE SparkOTreeManager: End: index with revision Revision(4,1711452971649,/Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch,50000,List(LinearTransformer(fare_amount,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(trip_distance,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformer(passenger_count,io.qbeast.core.model.LongDataType$@39262d61)),List(LinearTransformation(-1.33391414E8,401092.32,-8.769787799728292E7,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0.0,389678.46,56776.69812021143,io.qbeast.core.model.DoubleDataType$@6f5866ec), LinearTransformation(0,9,7,io.qbeast.core.model.LongDataType$@39262d61)))
24/04/04 13:25:41 TRACE IndexedTableImpl: End: Writing data to table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:25:41 TRACE IndexedTableImpl: End: Done writing data to table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:25:41 TRACE IndexedTableImpl: End: save table /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch
24/04/04 13:25:41 TRACE QbeastDataSource: End: Create Qbeast relation /Users/fpj/tmp/test-data/nyc-qbeast-table-2nd-batch

For this command:

dfCasted.write.format("qbeast").mode("append").
                                        option("columnsToIndex", "fare_amount,trip_distance,passenger_count").
                                        option("cubeSize", "50000").
                                        save("/path/nyc-qbeast-table")