Open parthosa opened 1 week ago
The metadata properties RequiredDataFilters
and DictionaryFilters
are present only in Photon event logs. This metadata information is saved in the data_source_information.csv
file.
Current schema of data_source_information.csv
:
root
|-- appIndex: integer (nullable = true)
|-- sqlID: integer (nullable = true)
|-- sql_plan_version: integer (nullable = true)
|-- nodeId: integer (nullable = true)
|-- format: string (nullable = true)
|-- buffer_time: integer (nullable = true)
|-- scan_time: integer (nullable = true)
|-- data_size: long (nullable = true)
|-- decode_time: integer (nullable = true)
|-- location: string (nullable = true)
|-- pushedFilters: string (nullable = true)
|-- schema: string (nullable = true)
|-- data_filters: string (nullable = true)
|-- partition_filters: string (nullable = true)
|-- from_final_plan: boolean (nullable = true)
After addition of these there will be two new columns:
root
|-- appIndex: integer (nullable = true)
..
|-- data_filters: string (nullable = true)
|-- partition_filters: string (nullable = true)
|-- required_data_filters: string(nullable = true)
|-- dictionary_filters: string(nullable = true)
required_data_filters
and dictionary_filters
will always be empty for non-Photon event logs. Should we use a dynamic schema to exclude these columns for non-Photon event logs, or keep them as empty fields?cc: @amahussein @tgravescs @mattahrens
Photon Scan nodes contain additional metadata fields not present in Spark Scan nodes, such as RequiredDataFilters and DictionaryFilters.
Example of a complete JSON representation of a PhotonScan node:
We store this metadata in data_source_information.csv and should ensure these additional fields are included when parsing the PhotonScan node.