Is it possible to read a nested Binary Field?

AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Apache License 2.0

138 stars 77 forks source link

Background

Let's say that I'm reading a "normal" AVRO file using Spark. One of the fields in the schema of this Avro is a Binary encoded as EBCDIC that should be decoded using a copycobol referenced by another field within the same schema. Potentially each record can have its copycobol (so for each record the binary might have a different schema) and the desiderata is to produce a json version of the binary field to store somewhere else.

The DF looks something like this:	ID	SCHEMA_ID	BINARY_FIELD	FIELD1	FIELD2	.....
1	001	M1B1N4R11	valueX	valueZ	..
2	010	M1B1N4R12	valueY	valueW	..

The DF looks something like this:

SCHEMA_ID

BINARY_FIELD

FIELD1

FIELD2

.....

001

M1B1N4R11

valueX

valueZ

010

M1B1N4R12

valueY

valueW

And in the folder copycobol/ I have:

001.cob

010.cob

Hi, thanks for the interest in the library. Yes, it is possible to use Cobrix in this case, but it can be quite involved. You can't use spark-cobol Spark data source to decode the data, but have to do it manually like this:

You need to parse each copybook to get an AST:

val copybookForField1 = CopybookParser.parseSimple(copyBookContents)

Then, you can decode each value by applying the copybook to the binary field:
```
val row = RecordExtractors.extractRecord(copybookForField1.ast, field1Bytes, 0, handler = handler)
val record = handler.create(row.toArray, copybook.ast)
```
The resulting record will be Array[Any] and for each subfield you can cast to the corresponding Java data type.
If you want decoding to happen in parallel handeled by Spark SQL, you can write a UDF per field. Each UDF could contain pre-parsed copybook, and can just apply extractRecord() and handler.create() to each value. The resulting output can be a JSON string. See how Jackson could be used to convert each record to a JSON: https://github.com/AbsaOSS/cobrix/blob/68f7362ed55db66a51293de207c4ca0d83af0c83/cobol-converters/src/test/scala/za/co/absa/cobrix/cobol/converters/extra/SerializersSpec.scala#L161

Let me know if you decide to do it and have any issues.

AbsaOSS / cobrix

Is it possible to read a nested Binary Field? #658

Background

Question