AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 77 forks source link

Is it possible to read a nested Binary Field? #658

Open Il-Pela opened 9 months ago

Il-Pela commented 9 months ago

Background

Let's say that I'm reading a "normal" AVRO file using Spark. One of the fields in the schema of this Avro is a Binary encoded as EBCDIC that should be decoded using a copycobol referenced by another field within the same schema. Potentially each record can have its copycobol (so for each record the binary might have a different schema) and the desiderata is to produce a json version of the binary field to store somewhere else.

The DF looks something like this: ID SCHEMA_ID BINARY_FIELD FIELD1 FIELD2 .....
1 001 M1B1N4R11 valueX valueZ ..
2 010 M1B1N4R12 valueY valueW ..

And in the folder copycobol/ I have:

Question

Is it possible to leverage the library to decode a field instead of a file? Or do I have to save the binary field temporarily in a file and decode it from there?

Thank you for any suggestion! :)

yruslan commented 9 months ago

Hi, thanks for the interest in the library. Yes, it is possible to use Cobrix in this case, but it can be quite involved. You can't use spark-cobol Spark data source to decode the data, but have to do it manually like this:

  1. You need to parse each copybook to get an AST:
    val copybookForField1 = CopybookParser.parseSimple(copyBookContents)
  2. Then, you can decode each value by applying the copybook to the binary field:
    val row = RecordExtractors.extractRecord(copybookForField1.ast, field1Bytes, 0, handler = handler)
    val record = handler.create(row.toArray, copybook.ast)

    The resulting record will be Array[Any] and for each subfield you can cast to the corresponding Java data type.

  3. If you want decoding to happen in parallel handeled by Spark SQL, you can write a UDF per field. Each UDF could contain pre-parsed copybook, and can just apply extractRecord() and handler.create() to each value. The resulting output can be a JSON string. See how Jackson could be used to convert each record to a JSON: https://github.com/AbsaOSS/cobrix/blob/68f7362ed55db66a51293de207c4ca0d83af0c83/cobol-converters/src/test/scala/za/co/absa/cobrix/cobol/converters/extra/SerializersSpec.scala#L161

Let me know if you decide to do it and have any issues.