record_format VB file fails with length of BDW block is too big

murali2812 commented 1 year ago

When converting a Variable Block format EBCDIC file, I got the error "The length of BDW block is too big", tried with following option but still getting same error.

dataframe = spark.read.format("cobol").\ option("copybook", util_params["copybook_path"]).\ option("encoding", "ebcdic"). \ option("schema_retention_policy", "collapse_root"). \ option("record_format", "VB"). \ option("is_bdw_big_endian", "true"). \ option("is_rdw_big_endian", "true"). \ option("bdw_adjustment", -4) . \ option("rdw_adjustment", -4) . \ option("generate_record_id", True).\ load(file_path)

Error:

WARN BlockManager: Putting block rdd_1_0 failed due to exception java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.. WARN BlockManager: Block rdd_1_0 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.

Please suggest some way to fix this issue. Can you please share the example where you have tested the VB scenario with EBCDIC file and copybook for reference.

yruslan commented 1 year ago

Hi,

Here is an example of processing of VB (BDW+RDW) files with a simple copybook: https://github.com/AbsaOSS/cobrix/blob/master/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test36RdwBdwSimpleSpec.scala

murali2812 commented 1 year ago

Hi @yruslan

Thanks for the reply. We already gone through above code but didn't get any solution to our problem, as mentioned above whenever we trying to use VB option with the adjustment, we are getting "The length of BDW block is too big" error.

dataframe = spark.read.format("cobol"). option("copybook", util_params["copybook_path"]). option("encoding", "ebcdic"). option("schema_retention_policy", "collapse_root"). option("record_format", "VB"). option("is_bdw_big_endian", "true"). option("is_rdw_big_endian", "true"). option("bdw_adjustment", -4) . option("rdw_adjustment", -4) . option("generate_record_id", True). load(file_path)

WARN BlockManager: Putting block rdd_1_0 failed due to exception java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.. WARN BlockManager: Block rdd_1_0 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.

Can you please provide some solution for above issue.

yruslan commented 1 year ago

The generic approach is to try to simulate the record header parser manually in a Hex Editor in order to understand headers of your file. Things like:

What is the first BDW header and RDW header.
What i the offset of the second RDW of the first block.
What is the offset and value of the next BDW header. Based on this you can determine if you need to apply any adjustments to BDW and/or RDW.

The error message indicates that the record extractor encountered a wrong BDW block. This can happen when there is no BDW header at the specified offset.

Now i've noticed that the error happens at the offset 0. Are you sure your file has BDW+RDW headers?

What are the first 8 bytes of your file?

jaysara commented 9 months ago

I am having similar error. My file is located at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/ebcdic_bdwrdw.dat

This is how this file is parsed outside of spark. Record -1: BDW = 430 RDW = 426 HEADER Record Type = 422 Record -2 : BDW = 968 RDW = 964 BASE-SEGMENT Record Type = 960 Record -3 : BDW = 768 RDW = 764 BASE-SEGMENT Record Type = 760 Record -4 : BDW = 1034 RDW = 1030 BASE-SEGMENT Record Type = 1026 ....... (the last record is TRAILER. ) Record -12 : BDW = 430 RDW = 426 TRAILER Record Size = 420

This file has total 12 Records (including Header and Trailer). I am using folllowign

        Dataset<Row> df1 =  spark.read()
                .format("za.co.absa.cobrix.spark.cobol.source")
                .option("copybook_contents", copybook)
                .option("encoding", "ebcdic")
                .option("record_format", "VB") // Variable length records
                .option("is_rdw_big_endian", "true")
                .option("is_bdw_big_endian", "true")

                .option("schema_retention_policy", "collapse_root")
                .option("bdw_adjustment", -4)
                .option("rdw_adjustment", -4)

This is my copybook contents

copybook =
                        01  RECORD.
                                   05  BASE-SEGMENT                   PIC X(123)

The above file has BDW only for one record. That may not be the typical case. (more commoly)We will have a BDW for multiple records.. e.g

Record -2 : BDW = 1732 RDW = 964 BASE-SEGMENT Record size = 960 Record -3 : RDW = 764 BASE-SEGMENT Record size = 760

What else should I specify? Here is the error that I get,

Caused by: java.lang.IllegalStateException: The length of BDW block is too big. Got 1895101420. Header: 240,244,243,240, offset: 0.
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53)
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.getRecordLength(RecordHeaderDecoderBdw.scala:48)

yruslan commented 9 months ago

Hi,

The example file starts with 0xC3 0xB0 0xC3 0xB4 (which is the same are reported by the error message Header: 240,244,243,240).

Please, clarify how did you parse the file to get BDW=430, RDW=426? Which bytes of the file?

jaysara commented 9 months ago

I apologize. I made an error in uploading the file. The ebcdic file with BDW and RDW is at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/bdw-rdw-sample-ebcdic.dat The ASCII equivalent of this file is at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/bdw-rdw-sample.txt

Here are the read options that I use,

Dataset<Row> df1 =  spark.read()
                .format("za.co.absa.cobrix.spark.cobol.source")
                .option("copybook_contents", copybook)
                .option("encoding", "ebcdic")
                .option("record_format", "VB") // Variable length records
                .option("is_rdw_big_endian", "false")
                .option("is_bdw_big_endian", "false")
                .option("schema_retention_policy", "collapse_root")
                .option("bdw_adjustment", -4)
                .option("rdw_adjustment", -4)
                .load("/Users/jsaraiy/Sandbox/spark-cobol-jay/data/ebcdic-bdw-rdw.dat");

I get following error ,

Caused by: java.lang.IllegalStateException: The length of BDW block is too big. Got 1961947628. Header: 240,241,240,244, offset: 0.
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53)
    at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)

If I change the record_format to V from VB .option("record_format", "V") // The program runs w/out error, however, it does not parse out the Segments correctly. it all comes as one row like below. +--------------------+ | SEGMENT| +--------------------+ |0100HEADER 3 NOT ...| +--------------------+

yruslan commented 9 months ago

Hi, The corrected files also do not have neither BDW or RDW headers. BDW and RDW headers are binary fields, while your file contains only text fields.

More on BDW headers: https://www.ibm.com/docs/en/zos/2.1.0?topic=records-block-descriptor-word-bdw More on RDW headers: https://www.ibm.com/docs/en/zos/2.1.0?topic=records-record-descriptor-word-rdw

If the file has variable length records, these are options available:

record_format = V if the file has RDW headers
record_format = VB if the file contains both BDW and RDW headers
record_format = D if the file is ASCII and records are separated by line ending characters
record_format = V, and record_length_field = if record length can be derived from a field in the copybook via an arithmetic expression
A custom record extractor can be used if the logic of determining the record length for each record is custom and nothing from the above works (https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#custom-record-extractors)

From my experience, quite often the team that handles copying of data from the mainframe can adjust conversion options to include RDW headers. This is the most reliable way of getting the data as accurate as possible.

AbsaOSS / cobrix

record_format VB file fails with length of BDW block is too big #642