AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 77 forks source link

Add an option to store casting errors in a separate field #723

Open yruslan opened 1 month ago

yruslan commented 1 month ago

Background

Currently, if an EBCDIC data fails to cast to the proper type, for example, when wrong bytes are provided for COMP-3 decoding, Cobrix will silently return null.

It would be great if such casting errors are gathered in a special column in the returned dataset.

spark-csv adds '_corrupted_record' column. when it can't parse the CSC record.

In Cobrix case, the column name can be chosen by the user, and it should be an array of issues.

Feature

Add an option to store casting errors in a separate field.

Example

.option("decode_error_column", "errors")

Which might return something like:

{ 
   /*...*/
   "errors": [
      "Decoding error for COMP-3, bytes: 0x01231A",
      "Decoding error for COMP, 4 digits, overflow, number=12345, bytes: 0x011223"
   ]
}

Proposed Solution

Add errors only if the setting is enabled. This might have performance and output size inpact.