AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 78 forks source link

Variable Length file processing error #566

Open baskarangit opened 1 year ago

baskarangit commented 1 year ago

Describe the bug

Getting Exception while processing the Variable length source file , while using Cobrix

BDW headers contain non-zero values where zeros are expected (check 'rdw_big_endian' flag. Header: 196,197,0,12, offset: 0.

Code snippet that caused the issue

df= spark.read.format("cobol").option("copybook", "/cobrix/copybook/copybook.txt")\
        .option("record_format", "VB").option("is_rdw_big_endian", "false")\
        .load("/cobrix/data/datapath")  
df.show()

Expected behavior

A clear and concise description of what you expected to happen. Parse the copybook and variable length source files and display the results.

Context

Copybook (if possible)


            01  REPORT-TAPE-DETAIL-RECORD.                                   
           02  VXT-REC-CODE-BYTES.                                      00000130
               03  VXT-REC-CODE-KEY              PIC X.                 00000140
               03  VXT-REC-TYPE-CONTROL          PIC X.                 00000150
01.021         03  VXT-XFER-KEY.                                        00000160
01.021             04  VXT-NO-POST-REASON        PIC S9(3)V      COMP-3.00000170
           02  VXT-SPECIAL-ACCT-BYTES.                                  00000180
               03  VXT-SC-1                      PIC X.                 00000190
               03  VXT-SC-2                      PIC X.                 00000200
               03  VXT-SC-3                      PIC X.                 00000210
               03  VXT-SC-4                      PIC X.                 00000220
               03  VXT-SC-5                      PIC X.                 00000230
               03  VXT-SC-6                      PIC X.                 00000240
               03  VXT-SC-7                      PIC X.                 00000250
               03  VXT-SC-8                      PIC X.                 00000260
           02  VXT-RPT-FULL-ACCT-NO.                                    00000270
               03  VXT-RPT-SYSTEM-BANK.                                 00000280
                   04  VXT-RPT-SYSTEM-NO         PIC XXXX.              00000290
                   04  VXT-RPT-BANK-NO.                                 00000300
                       05  VXT-RPT-PRIN-BANK     PIC XXXX.              00000310
                       05  VXT-RPT-AGENT-BANK    PIC XXXX.              00000320
               03  VXT-RPT-RECEIPT-NUMBER        PIC X(16).             00000330
           02  VXT-RECEIPT-CODE              PIC S9(3)V      COMP-3.00000340
           02  VXT-MRCH-FULL-ACCT-NO.                                   00000350

Source file  :
![image](https://user-images.githubusercontent.com/59495372/211696338-20165219-816c-433c-8db6-8088987d66ea.png)
sree018 commented 1 year ago

Hi @baskarangit,

Please use below options for your file

record_format=VB is_rdw_big_endian=true rdw_adjustment=-4 bdw_adjustment=-4 is_bdw_big_endian=true variable_size_occurs=true

baskarangit commented 1 year ago

Hi @sree018

I tried as you suggested but getting same below error.Kindly review and help me on this please .

Error Message : java.lang.IllegalStateException: The length of BDW block is too big. Got 1153761292. Header: 196,197,0,12, offset: 0. at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53) at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)

Current Code : df = spark.read.format("cobol").option("copybook", "/cobrix/copybook/copybook.txt")\ .option("record_format", "VB").option("is_rdw_big_endian", "true")\ .option("rdw_adjustment", "-4").option("bdw_adjustment", "-4")\ .option("is_bdw_big_endian", "true").option("variable_size_occurs", "true")\ .load("/cobrix/data/datapat").show()

Error Message :

image

yruslan commented 1 year ago

The error message says that BDW and RDW headers should contain 2 zero bytes either at the beginning or at the end. This is not happening in your case. Are you sure your file has the record format of VB? Can you post HEX of the first 10 bytes of your file?

baskarangit commented 1 year ago

Hi @yruslan ,

Based on TechDocument which mentions as record length as "2868" , So tried with below fixed width option . But gives me error .

.option("record_format", "F").option("record_length", "2868")

Error : file:/cobrix/data/c1b_dmon/sourcedatafile_2022_09_27.TXT size (1932928) IS NOT divisible by 2868.

So then i decided with Variable records length option . But even that didnt help and gives error. df = spark.read.format("cobol").option("copybook", "/cobrix/copybook/copybook.txt") .option("record_format", "VB").option("is_rdw_big_endian", "true") .option("rdw_adjustment", "-4").option("bdw_adjustment", "-4") .option("is_bdw_big_endian", "true").option("variable_size_occurs", "true") .load("/cobrix/data/datapat").show()

As you requested , i have provided image of the source file below , Kindly take a look and let me if i am missing something. Thanks in Advance.

Screenshot : image

yruslan commented 1 year ago

I can't recognize neither RDW nor BDW blocks from the screenshot. So either is your file is in the fixed record length format (F), or it uses some other encoding to encode record sizes. Also there is a possibility that the file has some header and footer that needs to be removed before treating it as a fixed record length. You can use 'file_start_offset' and 'file_end_offset' options to do that.

Also you can debug if you data is decoded correctly using '.option("debug_ignore_file_size", "true")'. But it will allow parsing only limited records from the beginning of the file (using .show(false), for instance). But will fail if you try to process the full file.

Unfortunately, I can't give you more specific advice. Before Cobrix can decode the data it needs a way of splitting the input file by records. And if it is in a non-standard format, you need to really understand how to do it. You can use custom record extractors if standard record formats do not work for you.

baskarangit commented 1 year ago

Hi @yruslan ,

thanks for your response . i will try your suggestion and revert on this .

Could you help me with some examples related to custom record extractor ? However i have below links that i am planned to review .

cobrix/examples/examples-collection/src/main/scala/com/example/spark/cobol/examples/parser/generators/

Thanks in advance .

yruslan commented 1 year ago

Here is an example of a record extractor: https://github.com/AbsaOSS/cobrix/blob/a62c136266d1e46b7ebb17e1de1b1f10d9a2d878/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/mocks/CustomRecordExtractorMock.scala#L40-L40

Basically, it just implements the 'next()' method. You have input raw data as a simple stream of bytes, and your 'next()' method returns the next record as an array of bytes.

Input(byte stream):
abcdabcdabcdabcdabcdabcd

Output (records)
(abcd)(abcd)(abcd)(abcd)(abcd)
baskarangit commented 1 year ago

Hi @yruslan ,

I tried using option("debug_ignore_file_size", "true") , it able to parse the files , but the format in dataframe doesnt have proper structure . ( Records are not loaded correctly in to dataframe) . I have used below syntax

df = spark.read.format("cobol").option("copybook", "/cobrix/copybook/copybook.txt") .option("debug_ignore_file_size", "true") .load("/cobrix/data/datapath").show(3)

Also I came to know that, file record length is not fixed and it varies based on one of the column's value.

I have a scenario , where file record size is of variable length. File recrod has Base Segment ( which is constant of 583 bytes ) + Variable Segment ( Based on RDT-NO-POST-REASON value in Base Segment )

In More detail : Base Record : 583 Bytes ( From 0 Bytes to 583 Bytes ) Additional Segment : From 584 bytes - 5636 bytes. (REPEATS 0-99 Times , Each 50 Bytes Length , Based on value of RDT-NO-POST-REASON) RDT-NO-POST-REASON is column of Integer in Base Segment.

For Example : Record #1 example: BaseSegment : 583 Bytes Value of RDT-NO-POST-REASON = 0 , so AdditionalSegment is 0. Additonal Segment : None Total Record Byte length : 583 Bytes + 0 => 583 bytes

Record #2 example: BaseSegment : 583 Bytes Value of RDT-NO-POST-REASON = 1 , so AdditionalSegment is 1. Additonal Segment : 50 Bytes ( Additional segment is 50 Bytes each ) Total Record Byte : 583 Bytes + 50 Bytes => 633 Bytes

Record #3 example: BaseSegment : 583 Bytes Value of RDT-NO-POST-REASON = 10 , so AdditionalSegment is 10. Additonal Segment : 500 Bytes ( 50 Bytes * 10 ) Total Record Byte : 583 Bytes + 500 Bytes => 1083 Bytes

So Based on AdditionalSegment , Record ByteLength can ranges from 583 to 5636 Bytes. Kindly help me,if such file process can been handled by Cobrix ? if so , Kindly share how this can be handled and suggestions on this.

Kindly review and let me know . Thanks in Advance.

yruslan commented 1 year ago

You can try adding the variable segment field to the end of the copybook as with something like

      02 SEGMENT GROUP OCCURS 0 TO 100 TIMES DEPENDING ON RDT-NO-POST-REASON.
         03 PAYLOAD PIC X(50).

and add this option:

.option("variable_size_occurs", "true")
yruslan commented 1 year ago

But your use case inspired an idea: https://github.com/AbsaOSS/cobrix/issues/569

Maybe in the future it can help parsing these kinds of files easier.

sree018 commented 1 year ago

@baskarangit

I received similar file from Fiserv system and able parse it in our systems.

file characteristics

record_format=VB is_rdw_big_endian=true rdw_adjustment=-4 bdw_adjustment=-4 is_bdw_big_endian=true variable_size_occurs=true

I found a copybook which similar to your copybook description. Please see copybook in issue :259

If you have any questions regarding your file, please reach me sdama018@gmail.com

baskarangit commented 1 year ago

Hi @yruslan , thanks for adding my scenario as part of new idea in your board. I

Hi @yruslan / @sree018 ,

I have updated my copybook with DEPENDING keyword as below , but it gave me different error.

Existing Copybook :

1.034A 02 VXT-ADDL-DATA-GROUP. 00002510 1.047B 03 VXT-ADDL-DATA OCCURS 99 TIMES. 00002520 1.034A 05 VXT-ADDL-SEG-KEY. 00002530 1.038A 10 VXT-ADDL-SEG-KEY-PROD PIC X(02). 00002540 1.038A 10 VXT-ADDL-SEG-KEY-TYPE PIC X(01). 00002550 1.038A 05 FILLER PIC X(47). 00002560

Update Copybook : 1.034A 02 VXT-ADDL-DATA-GROUP. 00002510 1.047B 03 VXT-ADDL-DATA OCCURS 0 TO 99 TIMES DEPENDING ON VXT-ADDL-SEGS-NO. 00002520 1.034A 05 VXT-ADDL-SEG-KEY. 00002530 1.038A 10 VXT-ADDL-SEG-KEY-PROD PIC X(02). 00002540 1.038A 10 VXT-ADDL-SEG-KEY-TYPE PIC X(01). 00002550 1.038A 05 FILLER PIC X(47). 00002560

Error message : image

Also , i noticed in my copybook that i have multiple record types as below .Kindly guide me how this file can parsed . thanks in advance.

   01  REPORT-TAPE-DETAIL-RECORD.                                   
       02  VXT-REC-CODE-BYTES.                                      00000130
   01  VXT-RECORD-TYPE-STMT        REDEFINES                        00002670
                             REPORT-TAPE-DETAIL-RECORD.       00002680
 02  VXT-BASE-REC                      PIC X(583).            00002690
 02  FILLER                            PIC X(5053).           00002700
 01  VXT-RECORD-TYPE-DCX         REDEFINES                        00002710
                                   REPORT-TAPE-DETAIL-RECORD.       00002720
yruslan commented 1 year ago