Open baskarangit opened 1 year ago
Hi @baskarangit,
Please use below options for your file
record_format=VB is_rdw_big_endian=true rdw_adjustment=-4 bdw_adjustment=-4 is_bdw_big_endian=true variable_size_occurs=true
Hi @sree018
I tried as you suggested but getting same below error.Kindly review and help me on this please .
Error Message : java.lang.IllegalStateException: The length of BDW block is too big. Got 1153761292. Header: 196,197,0,12, offset: 0. at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53) at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)
Current Code : df = spark.read.format("cobol").option("copybook", "/cobrix/copybook/copybook.txt")\ .option("record_format", "VB").option("is_rdw_big_endian", "true")\ .option("rdw_adjustment", "-4").option("bdw_adjustment", "-4")\ .option("is_bdw_big_endian", "true").option("variable_size_occurs", "true")\ .load("/cobrix/data/datapat").show()
Error Message :
The error message says that BDW and RDW headers should contain 2 zero bytes either at the beginning or at the end. This is not happening in your case. Are you sure your file has the record format of VB? Can you post HEX of the first 10 bytes of your file?
Hi @yruslan ,
Based on TechDocument which mentions as record length as "2868" , So tried with below fixed width option . But gives me error .
.option("record_format", "F").option("record_length", "2868")
Error : file:/cobrix/data/c1b_dmon/sourcedatafile_2022_09_27.TXT size (1932928) IS NOT divisible by 2868.
So then i decided with Variable records length option . But even that didnt help and gives error. df = spark.read.format("cobol").option("copybook", "/cobrix/copybook/copybook.txt") .option("record_format", "VB").option("is_rdw_big_endian", "true") .option("rdw_adjustment", "-4").option("bdw_adjustment", "-4") .option("is_bdw_big_endian", "true").option("variable_size_occurs", "true") .load("/cobrix/data/datapat").show()
As you requested , i have provided image of the source file below , Kindly take a look and let me if i am missing something. Thanks in Advance.
Screenshot :
I can't recognize neither RDW nor BDW blocks from the screenshot. So either is your file is in the fixed record length format (F), or it uses some other encoding to encode record sizes. Also there is a possibility that the file has some header and footer that needs to be removed before treating it as a fixed record length. You can use 'file_start_offset' and 'file_end_offset' options to do that.
Also you can debug if you data is decoded correctly using '.option("debug_ignore_file_size", "true")'. But it will allow parsing only limited records from the beginning of the file (using .show(false), for instance). But will fail if you try to process the full file.
Unfortunately, I can't give you more specific advice. Before Cobrix can decode the data it needs a way of splitting the input file by records. And if it is in a non-standard format, you need to really understand how to do it. You can use custom record extractors if standard record formats do not work for you.
Hi @yruslan ,
thanks for your response . i will try your suggestion and revert on this .
Could you help me with some examples related to custom record extractor ? However i have below links that i am planned to review .
cobrix/examples/examples-collection/src/main/scala/com/example/spark/cobol/examples/parser/generators/
Thanks in advance .
Here is an example of a record extractor: https://github.com/AbsaOSS/cobrix/blob/a62c136266d1e46b7ebb17e1de1b1f10d9a2d878/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/mocks/CustomRecordExtractorMock.scala#L40-L40
Basically, it just implements the 'next()' method. You have input raw data as a simple stream of bytes, and your 'next()' method returns the next record as an array of bytes.
Input(byte stream):
abcdabcdabcdabcdabcdabcd
Output (records)
(abcd)(abcd)(abcd)(abcd)(abcd)
Hi @yruslan ,
I tried using option("debug_ignore_file_size", "true") , it able to parse the files , but the format in dataframe doesnt have proper structure . ( Records are not loaded correctly in to dataframe) . I have used below syntax
df = spark.read.format("cobol").option("copybook", "/cobrix/copybook/copybook.txt") .option("debug_ignore_file_size", "true") .load("/cobrix/data/datapath").show(3)
Also I came to know that, file record length is not fixed and it varies based on one of the column's value.
I have a scenario , where file record size is of variable length. File recrod has Base Segment ( which is constant of 583 bytes ) + Variable Segment ( Based on RDT-NO-POST-REASON value in Base Segment )
In More detail : Base Record : 583 Bytes ( From 0 Bytes to 583 Bytes ) Additional Segment : From 584 bytes - 5636 bytes. (REPEATS 0-99 Times , Each 50 Bytes Length , Based on value of RDT-NO-POST-REASON) RDT-NO-POST-REASON is column of Integer in Base Segment.
For Example : Record #1 example: BaseSegment : 583 Bytes Value of RDT-NO-POST-REASON = 0 , so AdditionalSegment is 0. Additonal Segment : None Total Record Byte length : 583 Bytes + 0 => 583 bytes
Record #2 example: BaseSegment : 583 Bytes Value of RDT-NO-POST-REASON = 1 , so AdditionalSegment is 1. Additonal Segment : 50 Bytes ( Additional segment is 50 Bytes each ) Total Record Byte : 583 Bytes + 50 Bytes => 633 Bytes
Record #3 example: BaseSegment : 583 Bytes Value of RDT-NO-POST-REASON = 10 , so AdditionalSegment is 10. Additonal Segment : 500 Bytes ( 50 Bytes * 10 ) Total Record Byte : 583 Bytes + 500 Bytes => 1083 Bytes
So Based on AdditionalSegment , Record ByteLength can ranges from 583 to 5636 Bytes. Kindly help me,if such file process can been handled by Cobrix ? if so , Kindly share how this can be handled and suggestions on this.
Kindly review and let me know . Thanks in Advance.
You can try adding the variable segment field to the end of the copybook as with something like
02 SEGMENT GROUP OCCURS 0 TO 100 TIMES DEPENDING ON RDT-NO-POST-REASON.
03 PAYLOAD PIC X(50).
and add this option:
.option("variable_size_occurs", "true")
But your use case inspired an idea: https://github.com/AbsaOSS/cobrix/issues/569
Maybe in the future it can help parsing these kinds of files easier.
@baskarangit
I received similar file from Fiserv system and able parse it in our systems.
file characteristics
record_format=VB is_rdw_big_endian=true rdw_adjustment=-4 bdw_adjustment=-4 is_bdw_big_endian=true variable_size_occurs=true
I found a copybook which similar to your copybook description. Please see copybook in issue :259
If you have any questions regarding your file, please reach me sdama018@gmail.com
Hi @yruslan , thanks for adding my scenario as part of new idea in your board. I
Hi @yruslan / @sree018 ,
I have updated my copybook with DEPENDING keyword as below , but it gave me different error.
Existing Copybook :
1.034A 02 VXT-ADDL-DATA-GROUP. 00002510 1.047B 03 VXT-ADDL-DATA OCCURS 99 TIMES. 00002520 1.034A 05 VXT-ADDL-SEG-KEY. 00002530 1.038A 10 VXT-ADDL-SEG-KEY-PROD PIC X(02). 00002540 1.038A 10 VXT-ADDL-SEG-KEY-TYPE PIC X(01). 00002550 1.038A 05 FILLER PIC X(47). 00002560
Update Copybook : 1.034A 02 VXT-ADDL-DATA-GROUP. 00002510 1.047B 03 VXT-ADDL-DATA OCCURS 0 TO 99 TIMES DEPENDING ON VXT-ADDL-SEGS-NO. 00002520 1.034A 05 VXT-ADDL-SEG-KEY. 00002530 1.038A 10 VXT-ADDL-SEG-KEY-PROD PIC X(02). 00002540 1.038A 10 VXT-ADDL-SEG-KEY-TYPE PIC X(01). 00002550 1.038A 05 FILLER PIC X(47). 00002560
Error message :
Also , i noticed in my copybook that i have multiple record types as below .Kindly guide me how this file can parsed . thanks in advance.
01 REPORT-TAPE-DETAIL-RECORD.
02 VXT-REC-CODE-BYTES. 00000130
01 VXT-RECORD-TYPE-STMT REDEFINES 00002670
REPORT-TAPE-DETAIL-RECORD. 00002680
02 VXT-BASE-REC PIC X(583). 00002690
02 FILLER PIC X(5053). 00002700
01 VXT-RECORD-TYPE-DCX REDEFINES 00002710
REPORT-TAPE-DETAIL-RECORD. 00002720
Describe the bug
Getting Exception while processing the Variable length source file , while using Cobrix
BDW headers contain non-zero values where zeros are expected (check 'rdw_big_endian' flag. Header: 196,197,0,12, offset: 0.
Code snippet that caused the issue
Expected behavior
A clear and concise description of what you expected to happen. Parse the copybook and variable length source files and display the results.
Context
Copybook (if possible)