Closed pinakigit closed 4 months ago
Yes, Cobrix prefers short
, int
and long
for numbers that fit these types and doesn't have scale. You can convert them to decimal using cast: .withColumn("my_field", cast(col("my_field"), "decimal(10,0)")
.
Sure. We can do that but we will have to maintain a mapping for each file for which fields to convert and hence was asking.
One way to automate this is to use extended metadata. It adds more info from the copybook for each field.
.option("metadata", "extended")
You can view which metadata fields are available like this:
df.schema.fields.foreach{ field =>
val metadata = field.metadata
println(s"Field: ${field.name}, metadata: ${metadata.json} ")
}
// Returns: {"originalName":"MY_FIELD","precision":4,"usage":"COMP-3","signed":true,"offset":17,"byte_size":2,"sign_separate":false,"pic":"S9(4)","level":2}
The precision
is the one we need.
You can apply the casting logic something like this:
val columns = df.schema.fields.map { field =>
val metadata = field.metadata
if (metadata.contains("precision")) {
val precision = metadata.getLong("precision").toInt
if (field.dataType == LongType || field.dataType == IntegerType || field.dataType == ShortType) {
println(s"Cast ${field.name} to decimal($precision,0)")
col(field.name).cast(DecimalType(precision, 0)).as(field.name)
} else {
col(field.name)
}
} else {
col(field.name)
}
}
val dfNew = df.select(columns: _*)
dfNew.printSchema
An experimental method SparkUtils.covertIntegralToDecimal()
has been added in 2.7.1
. You can use it to post-process dataframe returned by Cobrix to map the schema the way you want:
val df = sparkread.format("cobol")
.option("metadata", "extended")
.load("/path/to/file")
val df2 = SparkUtils.covertIntegralToDecimal(df)
This is something we are going to implement as a new feature.
Something like:
.option("strict_integral_precision", "true")
(not final)
The changes are available in the master
branch. @pinakigit, please test if you can.
The behavior is turned on when using this option:
.option("strict_integral_precision", "true")
The feature is available in 2.7.3
: https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#273-released-17-jule-2024
I have a copybook where the field is defied as PIC S9(10)V USAGE COMP-3. When we read the file in cobrix this field is being created as Long instead of Decimal(10,0). Is it expected this way because its spark property or can it be created as Decimal(10,0)