AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 77 forks source link

PIC S9(10)V USAGE COMP-3 is converted to long instead of Decimal(10,0) #678

Closed pinakigit closed 4 months ago

pinakigit commented 6 months ago

I have a copybook where the field is defied as PIC S9(10)V USAGE COMP-3. When we read the file in cobrix this field is being created as Long instead of Decimal(10,0). Is it expected this way because its spark property or can it be created as Decimal(10,0)

yruslan commented 6 months ago

Yes, Cobrix prefers short, int and long for numbers that fit these types and doesn't have scale. You can convert them to decimal using cast: .withColumn("my_field", cast(col("my_field"), "decimal(10,0)").

pinakigit commented 6 months ago

Sure. We can do that but we will have to maintain a mapping for each file for which fields to convert and hence was asking.

yruslan commented 6 months ago

One way to automate this is to use extended metadata. It adds more info from the copybook for each field.

.option("metadata", "extended")

You can view which metadata fields are available like this:

df.schema.fields.foreach{ field =>
  val metadata = field.metadata
  println(s"Field: ${field.name}, metadata: ${metadata.json} ")
}

// Returns: {"originalName":"MY_FIELD","precision":4,"usage":"COMP-3","signed":true,"offset":17,"byte_size":2,"sign_separate":false,"pic":"S9(4)","level":2} 

The precision is the one we need.

You can apply the casting logic something like this:

val columns = df.schema.fields.map { field =>
  val metadata = field.metadata
  if (metadata.contains("precision")) {
    val precision = metadata.getLong("precision").toInt
    if (field.dataType == LongType || field.dataType == IntegerType || field.dataType == ShortType)  {
      println(s"Cast ${field.name} to decimal($precision,0)")
      col(field.name).cast(DecimalType(precision, 0)).as(field.name)
    } else {
      col(field.name)
    }
  } else {
    col(field.name)
  }
}

val dfNew = df.select(columns: _*)

dfNew.printSchema
yruslan commented 5 months ago

An experimental method SparkUtils.covertIntegralToDecimal() has been added in 2.7.1. You can use it to post-process dataframe returned by Cobrix to map the schema the way you want:

val df = sparkread.format("cobol")
  .option("metadata", "extended")
  .load("/path/to/file")

val df2 = SparkUtils.covertIntegralToDecimal(df)
yruslan commented 4 months ago

This is something we are going to implement as a new feature.

Something like:

.option("strict_integral_precision", "true")

(not final)

yruslan commented 4 months ago

The changes are available in the master branch. @pinakigit, please test if you can. The behavior is turned on when using this option:

.option("strict_integral_precision", "true")
yruslan commented 4 months ago

The feature is available in 2.7.3: https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#273-released-17-jule-2024