AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 78 forks source link

Data fetched in dataframe is blank or null #312

Open kanika167 opened 4 years ago

kanika167 commented 4 years ago

Background [Optional]

I have a copybook file like -

01 XXXXXX 04 AAAAA PIC X(10). 04 BAAAA PIC X(4). 04 CAAAA PIC X(4). 04 DAAAA PIC XX.

There is a data file (in .txt) with specified field length data. When I am trying to read it into a data frame I am just getting 1 column name XXXXXX and rows as list of actual columns. But even there the data is either null / blank

XXXXXX

[,,,null,] [,,,null,] [,,,null,] [,,,null,] [,,,null,]

Question

What am I doing wrong above?

yruslan commented 4 years ago

By default, Cobrix retains the root GROUP by putting all columns under the corresponding struct field. You can use a different schema retention polity to get your columns on the root level: .option("schema_retention_policy", "collapse_root")

Also, by looking at the sample output it seems that data hasn't been decoded properly. Use .option("debug", "true") to investigate what is being decoded.

kanika167 commented 4 years ago

I am trying to run the following command on spark shell

val df = spark.read.format("za.co.cobrix.spark.cobol.source").option("copybook","test.cob").load("/user/data")

I have passed the required jars -> spark-cobol,cobol-parser,scodec-bits/core and antlr4-runtime-4.8-1 (without this I was getting NoClassDefFoundError for org/antlr/v4/runtime/CharStreams)

but now I am getting below exception -

java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).

For security reasons I can't share with you the actual copybook and datafiles Spark Version - 2.2.0 Cloudera4 Also, where cam I find the documentation for this API.

kanika167 commented 4 years ago

I am trying to run the following command on spark shell

val df = spark.read.format("za.co.cobrix.spark.cobol.source").option("copybook","test.cob").load("/user/data")

I have passed the required jars -> spark-cobol,cobol-parser,scodec-bits/core and antlr4-runtime-4.8-1 (without this I was getting NoClassDefFoundError for org/antlr/v4/runtime/CharStreams)

but now I am getting below exception -

java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).

For security reasons I can't share with you the actual copybook and datafiles Spark Version - 2.2.0 Cloudera4 Also, where cam I find the documentation for this API.

NEVER MIND FOR THIS ISSUE. USED AN UBER JAR

kanika167 commented 4 years ago

But I am using option("schema_retention_option","collapse_root"), It didn't show me any difference in the schema structure

yruslan commented 4 years ago

The option("schema_retention_option","collapse_root") should make a difference. Try to compare outputs of df.printSchema.