File start/end offset issue for VB file

AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Apache License 2.0

136 stars 78 forks source link

File start/end offset issue for VB file #647

Open D3v3sh5ingh opened 10 months ago

D3v3sh5ingh commented 10 months ago

Hi @yruslan

Issue : 643

File_start_offset and File_end_offset options for VB files are not working and throwing the same error as posted in issue 643. I have a file with both RDW and BDW (Record Format VB) . The file is with header and footer also. I want to skip first few bytes of header and last few bytes of footer. For that using options file_start_offset and file_end_offset but getting the similar error as in issue 643.

yruslan commented 10 months ago

Hi @D3v3sh5ingh, what's your high level offset layout?

For example: 0 - 19 Headers (to be ignored) 20 - 23 BDW 24 - 27 RDW 28 - 99 Payload 100 - 193 RDW ... 32000 Payload 32093 Footer (to be ignored)

D3v3sh5ingh commented 10 months ago

Hi @yruslan My high level layout looks like below: BDW { RDW 45 bytes , RDW 1000 bytes, RDW 1000 bytes , RDW 1000 bytes ....} BDW { RDW 1000 bytes .....} ...... BDW { RDW 1000 bytes...., RDW 45 bytes}

45 bytes of header and trailer are inside the BDW as shown above. We want to remove these 45 bytes of header and trailer present in the file.

yruslan commented 10 months ago

file_start_offset and file_end_offset work on the level of file, e.g. cases like: HEDAER {45 bytes} BDW { RDW 1000 bytes, RDW 1000 bytes, RDW 1000 bytes , RDW 1000 bytes ....}

Since your 45 headers are part of record payload you can't do it using these options. What you can do is you can add the header as a redefine segment in your copybook, and then you can filter it out after you get the dataframe.

The copybook will looks like this:

01   RECORD.
   05  HEDAER.
        10 CONTENT X(45).
   05 PAYLOAD REDEFINES HEADER.
   ... your payload goes at level 10 here

D3v3sh5ingh commented 10 months ago

Hi , This is a sample output for my file . 45 bytes that i want to skip are at the start and at the end only . Not in each record. If I don't use the file _start_offset and file_end_offset , i am able to get above dataframe as output but I am getting two extra records(Header and Trailer). But if I use these options with 45 bytes , i face an error ( length of BDW block is too big ) .

IMG-20231130-WA0007

yruslan commented 10 months ago

Options 'file_start_offset' and 'file_end_offset' only drop bytes from the beginning or at the end of files, not from the payload. This is the expected behavior.

There are no options that allow dropping bytes from inside records, so possible solutions are:

If you need to keep these special 45-byte records, you can use the modified copybook solution above.
(probably your case) If you want to ignore these special 45-byte records, just remove these records in post-processing, e.g. df.filter(col("COL1").isNotNull)