Can I get the raw record bytes from ebcdic file w/out parsing

jaysara commented 10 months ago

I am trying to parse an ebcdic file for which I do not have a copybook. I know that whether it has RDW and/or BDW. It is one of the old legacy format file. We have written our own program that knows how to parse an individual record.

Is there a way that I can use Cobrix library only to parse and individual record in ebcdic bytes ? Once I get those bytes in rdd, I can write my 'map' function to parse the individual segments. I have used Cobrix library get an individual record. I have used following setup,

I have defined my copybook in a simple structute like below,

String copybook =
                "        01  RECORD.\n" +
                        "           05  SEGMENT                   PIC X(1064).\n" ;

Dataset<Row> df1 =  spark.read()
                .format("za.co.absa.cobrix.spark.cobol.source")
                .option("copybook_contents", copybook)
                .option("encoding", "ebcdic")
                .option("record_format", "V") // Variable length records
                .option("is_rdw_big_endian", "true")
                .option("rdw_adjustment", -4)
                .option("schema_retention_policy", "collapse_root")
                .load("data/Samples/sample_packed_variable_ebcdic_bigendian_rdw.dat");

I am able to parse the records based on the "RDW" value correctly. I get ROW object with only one element in it (as specified in my copybook with name SEGMENT). This SEGMENT is coming as string. I convert this string in "ibm500" character code set (convert to ebcdic) and parse it with the parsing program that we have written. Our program can parse the record based on the byte position. However, we are not able to parse the 'packed' decimal properly. It seems that the conversion from string to/from ebcdic bytes is losing the position and format. Is there a way for us to get origina raw bytes format as it appears in the file as part of the dataset that we get out. In shore, can 'SEGMENT' field in my example, represent the actual raw bytes for the entire record in the file ?

Question

Is above method an acceptable way that I can use this library ? I like the Corbix's way of parallel processing of reading large ebcdic file and all I want help from this library is to parse rdw/bdw value and return the entire record in raw bytes that I can use for my own parsing logic. As I do not have proper copybood for the bytes segment.

yruslan commented 9 months ago

Yes, exactly,PIC X(1064) is going to be converted from EBCDIC to ASCII, including packed decimals that are going to be corrupted because of that.

Note. You can use the maximum possible size in PIC. The field will be automatically truncated for each record if it is bigger than the record size.

If packed decimals are at fixed positions, I'd recommend splitting the segment:

      01  RECORD
            05  SEGMENT1                   PIC X(500).
            05  DECIMAL1                   PIC S9(6)V99 COMP-3.
   ****** ...
            05  SEGMENT2                   PIC X(500).

Or you can get the entire record. This can be done in one of 3 ways:

By generating a field containing the full binary record (without RDW itself)
```
.option("generate_record_bytes", "true")
```
This will create 'Recoord_bytes' binary field that you can process using, say, an UDF, and that way extract packed decimals. This way you will have both 'SEGMENT' (converted string) and binary representation of the same record.

Or you can define the segment as binary (by adding the usage COMP):

          05  SEGMENT1                   PIC X(1064) COMP.

Or you can turn debugging on and generate binary or hex representation of the SEGMENT field before the conversion. The debugging mode does not add any performance penalties despite the name:
```
.option("debug", "true")
.option("debug", "binary") // or "hex"
```

jaysara commented 9 months ago

Thanks ! That seems to be working.

jaysara commented 9 months ago

hi. @yruslan The program seems to be working fine with the above options. However, I am not able to achieve the parallelization. on my map functions on the dataframe output. Once. I get Dataset after reading, I need to call a map function on each Row. This map function parses the bytes in the Row and creates. Dataset. This works fine with smaller-size files. However, for 1gb file, everything is working in single-threaded fashion. I did not see any parallelism. The. number of partitions was always 1. Is there anything I should pass during the read option. Any idea, how can I achieve parallelization after reading and why do. I get only one partition even for 1 gb file. I ran program on local and cluster both and saw only one partition.

yruslan commented 9 months ago

It is strange, parallelism should be available for bigger files no matter the options. The way parallelism works for sequence works with variable length records is

First, the file is read sequentially without parsing in order to index records. This is a fast step.
Then, the index is distributed among tasks which run in parallel (according to the total number of executor threads).

Please, list all reader options again, will try to figure out what might be causing this.

Also, what is your cluster and spark-submit parmeters? Are you running on Yarn or some other setup? What the the number of executors?

AbsaOSS / cobrix

Can I get the raw record bytes from ebcdic file w/out parsing #656

Question