Open jaysara opened 10 months ago
Yes, exactly,PIC X(1064)
is going to be converted from EBCDIC to ASCII, including packed decimals that are going to be corrupted because of that.
Note. You can use the maximum possible size in PIC. The field will be automatically truncated for each record if it is bigger than the record size.
If packed decimals are at fixed positions, I'd recommend splitting the segment:
01 RECORD
05 SEGMENT1 PIC X(500).
05 DECIMAL1 PIC S9(6)V99 COMP-3.
****** ...
05 SEGMENT2 PIC X(500).
Or you can get the entire record. This can be done in one of 3 ways:
By generating a field containing the full binary record (without RDW itself)
.option("generate_record_bytes", "true")
This will create 'Recoord_bytes' binary field that you can process using, say, an UDF, and that way extract packed decimals. This way you will have both 'SEGMENT' (converted string) and binary representation of the same record.
Or you can define the segment as binary (by adding the usage COMP):
05 SEGMENT1 PIC X(1064) COMP.
Or you can turn debugging on and generate binary or hex representation of the SEGMENT field before the conversion. The debugging mode does not add any performance penalties despite the name:
.option("debug", "true")
.option("debug", "binary") // or "hex"
Thanks ! That seems to be working.
hi. @yruslan The program seems to be working fine with the above options. However, I am not able to achieve the parallelization. on my map functions on the dataframe output. Once. I get Dataset
It is strange, parallelism should be available for bigger files no matter the options. The way parallelism works for sequence works with variable length records is
Please, list all reader options again, will try to figure out what might be causing this.
Also, what is your cluster and spark-submit parmeters? Are you running on Yarn or some other setup? What the the number of executors?
I am trying to parse an ebcdic file for which I do not have a copybook. I know that whether it has RDW and/or BDW. It is one of the old legacy format file. We have written our own program that knows how to parse an individual record.
Is there a way that I can use Cobrix library only to parse and individual record in ebcdic bytes ? Once I get those bytes in rdd, I can write my 'map' function to parse the individual segments. I have used Cobrix library get an individual record. I have used following setup,
I have defined my copybook in a simple structute like below,
I am able to parse the records based on the "RDW" value correctly. I get ROW object with only one element in it (as specified in my copybook with name SEGMENT). This SEGMENT is coming as string. I convert this string in "ibm500" character code set (convert to ebcdic) and parse it with the parsing program that we have written. Our program can parse the record based on the byte position. However, we are not able to parse the 'packed' decimal properly. It seems that the conversion from string to/from ebcdic bytes is losing the position and format. Is there a way for us to get origina raw bytes format as it appears in the file as part of the dataset that we get out. In shore, can 'SEGMENT' field in my example, represent the actual raw bytes for the entire record in the file ?
Question
Is above method an acceptable way that I can use this library ? I like the Corbix's way of parallel processing of reading large ebcdic file and all I want help from this library is to parse rdw/bdw value and return the entire record in raw bytes that I can use for my own parsing logic. As I do not have proper copybood for the bytes segment.