AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
136 stars 78 forks source link

record_format VB file fails with length of BDW block is too big #642

Open murali2812 opened 1 year ago

murali2812 commented 1 year ago

When converting a Variable Block format EBCDIC file, I got the error "The length of BDW block is too big", tried with following option but still getting same error.

dataframe = spark.read.format("cobol").\ option("copybook", util_params["copybook_path"]).\ option("encoding", "ebcdic"). \ option("schema_retention_policy", "collapse_root"). \ option("record_format", "VB"). \ option("is_bdw_big_endian", "true"). \ option("is_rdw_big_endian", "true"). \ option("bdw_adjustment", -4) . \ option("rdw_adjustment", -4) . \ option("generate_record_id", True).\ load(file_path)

Error:

WARN BlockManager: Putting block rdd_1_0 failed due to exception java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.. WARN BlockManager: Block rdd_1_0 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.

Please suggest some way to fix this issue. Can you please share the example where you have tested the VB scenario with EBCDIC file and copybook for reference.

yruslan commented 1 year ago

Hi,

Here is an example of processing of VB (BDW+RDW) files with a simple copybook: https://github.com/AbsaOSS/cobrix/blob/master/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test36RdwBdwSimpleSpec.scala

murali2812 commented 1 year ago

Hi @yruslan

Thanks for the reply. We already gone through above code but didn't get any solution to our problem, as mentioned above whenever we trying to use VB option with the adjustment, we are getting "The length of BDW block is too big" error.

dataframe = spark.read.format("cobol"). option("copybook", util_params["copybook_path"]). option("encoding", "ebcdic"). option("schema_retention_policy", "collapse_root"). option("record_format", "VB"). option("is_bdw_big_endian", "true"). option("is_rdw_big_endian", "true"). option("bdw_adjustment", -4) . option("rdw_adjustment", -4) . option("generate_record_id", True). load(file_path)

WARN BlockManager: Putting block rdd_1_0 failed due to exception java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.. WARN BlockManager: Block rdd_1_0 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.

Can you please provide some solution for above issue.

yruslan commented 1 year ago

The generic approach is to try to simulate the record header parser manually in a Hex Editor in order to understand headers of your file. Things like:

The error message indicates that the record extractor encountered a wrong BDW block. This can happen when there is no BDW header at the specified offset.

Now i've noticed that the error happens at the offset 0. Are you sure your file has BDW+RDW headers?

What are the first 8 bytes of your file?

jaysara commented 9 months ago

I am having similar error. My file is located at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/ebcdic_bdwrdw.dat

This is how this file is parsed outside of spark. Record -1: BDW = 430 RDW = 426 HEADER Record Type = 422 Record -2 : BDW = 968 RDW = 964 BASE-SEGMENT Record Type = 960 Record -3 : BDW = 768 RDW = 764 BASE-SEGMENT Record Type = 760 Record -4 : BDW = 1034 RDW = 1030 BASE-SEGMENT Record Type = 1026 ....... (the last record is TRAILER. ) Record -12 : BDW = 430 RDW = 426 TRAILER Record Size = 420

This file has total 12 Records (including Header and Trailer). I am using folllowign

        Dataset<Row> df1 =  spark.read()
                .format("za.co.absa.cobrix.spark.cobol.source")
                .option("copybook_contents", copybook)
                .option("encoding", "ebcdic")
                .option("record_format", "VB") // Variable length records
                .option("is_rdw_big_endian", "true")
                .option("is_bdw_big_endian", "true")

                .option("schema_retention_policy", "collapse_root")
                .option("bdw_adjustment", -4)
                .option("rdw_adjustment", -4)

This is my copybook contents

copybook =
                        01  RECORD.
                                   05  BASE-SEGMENT                   PIC X(123)

The above file has BDW only for one record. That may not be the typical case. (more commoly)We will have a BDW for multiple records.. e.g

Record -2 : BDW = 1732 RDW = 964 BASE-SEGMENT Record size = 960 Record -3 : RDW = 764 BASE-SEGMENT Record size = 760

What else should I specify? Here is the error that I get,

Caused by: java.lang.IllegalStateException: The length of BDW block is too big. Got 1895101420. Header: 240,244,243,240, offset: 0.
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53)
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.getRecordLength(RecordHeaderDecoderBdw.scala:48)
yruslan commented 9 months ago

Hi,

The example file starts with 0xC3 0xB0 0xC3 0xB4 (which is the same are reported by the error message Header: 240,244,243,240).

Please, clarify how did you parse the file to get BDW=430, RDW=426? Which bytes of the file?

jaysara commented 9 months ago

I apologize. I made an error in uploading the file. The ebcdic file with BDW and RDW is at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/bdw-rdw-sample-ebcdic.dat The ASCII equivalent of this file is at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/bdw-rdw-sample.txt

Here are the read options that I use,

Dataset<Row> df1 =  spark.read()
                .format("za.co.absa.cobrix.spark.cobol.source")
                .option("copybook_contents", copybook)
                .option("encoding", "ebcdic")
                .option("record_format", "VB") // Variable length records
                .option("is_rdw_big_endian", "false")
                .option("is_bdw_big_endian", "false")
                .option("schema_retention_policy", "collapse_root")
                .option("bdw_adjustment", -4)
                .option("rdw_adjustment", -4)
                .load("/Users/jsaraiy/Sandbox/spark-cobol-jay/data/ebcdic-bdw-rdw.dat");

I get following error ,

Caused by: java.lang.IllegalStateException: The length of BDW block is too big. Got 1961947628. Header: 240,241,240,244, offset: 0.
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53)
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)

If I change the record_format to V from VB .option("record_format", "V") // The program runs w/out error, however, it does not parse out the Segments correctly. it all comes as one row like below. +--------------------+ | SEGMENT| +--------------------+ |0100HEADER 3 NOT ...| +--------------------+

yruslan commented 9 months ago

Hi, The corrected files also do not have neither BDW or RDW headers. BDW and RDW headers are binary fields, while your file contains only text fields.

More on BDW headers: https://www.ibm.com/docs/en/zos/2.1.0?topic=records-block-descriptor-word-bdw More on RDW headers: https://www.ibm.com/docs/en/zos/2.1.0?topic=records-record-descriptor-word-rdw

If the file has variable length records, these are options available:

From my experience, quite often the team that handles copying of data from the mainframe can adjust conversion options to include RDW headers. This is the most reliable way of getting the data as accurate as possible.