AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 77 forks source link

copybook meta data for RDBMS #634

Closed sree018 closed 1 year ago

sree018 commented 1 year ago

Background

Currently, copybook metadata comes as spark schema, we need schema as rdbms level

Example [Optional]

''' 01 MASTER-RECORD. 02 RDT-TLF-MTHD-NM PIC X(08).
02 RDT-ADJ-ORGN-TRAN-DT PIC 9(06).
02 FILLER PIC X(03). 02 RDT-ADDL-DATA-GROUP. 05 RDT-ADDL-DATA OCCURS 0 TO 2 TIMES DEPENDING ON RDT-ADDL-SEGS-NO.
10 RDT-ADDL-SEG-KEY.
15 RDT-ADDL-SEG-KEY-PROD PIC X(02).
15 RDT-ADDL-SEG-KEY-TYPE PIC S9(15)V99 COMP-3.
''' Current Schema: root |-- RDT-TLF-MTHD-NM String |-- RDT-ADJ-ORGN-TRAN-DT integer
|-- RDT-ADDL-DATA-GROUP |-- RDT-ADDL-SEG-KEY |-- RDT-ADDL-SEG-KEY-PROD String |-- RDT-ADDL-SEG-KEY-TYPE DECIMAL (15,2)

expected out |-- RDT-TLF-MTHD-NM VARCHAR(08) |-- RDT-ADJ-ORGN-TRAN-DT integer (06)
|-- RDT-ADDL-DATA-GROUP |-- RDT-ADDL-SEG-KEY |-- RDT-ADDL-SEG-KEY-PROD VARCHAR(08) |-- RDT-ADDL-SEG-KEY-TYPE DECIMAL (15,2)

we are able get parent-level element lengths only before flattening

df.schema.fields(0).metadata.getLong("maxLength")

is there any option to get the expected schema?

yruslan commented 1 year ago

Spark does not have varchar() type, nor integer(6) data types, only string and integer, so the expected output you specified is not possible.

However, it could be possible to retain metadata after schema flattening. How do you flat the schema?

sree018 commented 1 year ago

SparkUtils.flattenSchema(df,useShortFieldManes=false)

yruslan commented 1 year ago

I've tested if retaining the metadata is possible, and it is.

This PR makes SparkUtils.flattenSchema() retain metadata: https://github.com/AbsaOSS/cobrix/pull/635

It is already merged into master. Please, test if you can and let me know if it works for you.

sree018 commented 1 year ago

@yruslan

New feature working.

thanks for feature

yruslan commented 1 year ago

Awesome! Thanks for letting me know