Closed chandrasekaravr closed 5 years ago
Could you paste/send the whole copybook? That snippet is parsed correctly on my machine.
@yruslan I have pasted the whole copybook here. For the branch prepare_release_1_0_0, i had to remove the OF CARR-CLM-REC. from the DEPENDING ON
(replaced with a gist in the following comment)
Can you use pastebin or a gist? The formatting (including line breaks) is relevant for this issue.
Regarding the OF CARR-CLM-REC
part, I've never seen the OF
clause. Do you have some documentation to what it's supposed to do?
@tr11 here is the link for the gist, https://gist.github.com/chandrasekaravr/2cce4899b846e8bf82e02a5a077a54b0
The OF part refers to the group above that contains the count field. For now, it can be ignored. I have updated the gist with a fixed copybook.
I see, right now we just look for all the matching names so we'll need to add logic to the parser to handle the OF
clause. That said, if there are no name clashes, that clause becomes redundant so probably not a priority for now.
Take a look at the PR #173 -- the TO13
was a bug in the code. That PR solves this issue.
@tr11 Great. Thanks for taking care of the bug. The OF clause is not a blocker, as you said. Since there is no name clash, the parser is correctly identifying the group and creating an array of correct size. However, the group fields are empty. Any suggestions on that?
@tr11 @yruslan i did further investigation into why the line items addressed in the copybook as DEPENDING ON and OCCURS TO items not being parsed by debugging into the parser. Here are my findings, please let me know on how to proceed further. Or, if any fixes are required.
Please refer to lines 134 onwards in the copy book, https://gist.github.com/chandrasekaravr/2cce4899b846e8bf82e02a5a077a54b0
The section indicates the counts and the following are the line group items referred by depending on items, as such,
06 CARR-NCH-EDIT-CD-CNT PIC 99.
06 CARR-NCH-PATCH-CD-I-CNT PIC 99.
06 CARR-MCO-PRD-CNT PIC 9.
06 CARR-CLM-DEMO-ID-CNT PIC 9.
06 CARR-CLM-DGNS-CD-J-CNT PIC 99.
06 CARR-CLM-LINE-CNT PIC 99.
06 H-FILLER-9 PIC X(4).
04 CARR-CLM-VAR-GRP.
06 NCH-EDIT-GRP OCCURS 0 TO 13 TIMES
DEPENDING ON CARR-NCH-EDIT-CD-CNT.
08 NCH-EDIT-TRLR-IND-CD PIC X.
08 NCH-EDIT-CD PIC X(4).
06 NCH-PATCH-GRP OCCURS 0 TO 30 TIMES
DEPENDING ON CARR-NCH-PATCH-CD-I-CNT.
08 NCH-PATCH-TRLR-IND-CD
PIC X.
08 NCH-PATCH-CD PIC XX.
08 NCH-PATCH-APPLY-DT PIC 9(8).
Basically it has 6 following line group items,
06 NCH-EDIT-GRP OCCURS 0 TO 13 TIMES DEPENDING ON CARR-NCH-EDIT-CD-CNT. .... 06 NCH-PATCH-GRP OCCURS 0 TO 30 TIMES DEPENDING ON CARR-NCH-PATCH-CD-I-CNT. ... 06 MCO-PRD-GRP OCCURS 0 TO 2 TIMES DEPENDING ON CARR-MCO-PRD-CNT. ... 06 CLM-DEMO-ID-GRP OCCURS 0 TO 5 TIMES DEPENDING ON CARR-CLM-DEMO-ID-CNT. ... 06 CARR-CLM-DGNS-GRP OCCURS 0 TO 12 TIMES DEPENDING ON CARR-CLM-DGNS-CD-J-CNT. ... 06 CARR-LINE-GRP OCCURS 1 TO 13 TIMES DEPENDING ON CARR-CLM-LINE-CNT. ...
Note: All of these line items are optional as the min count is 0.
Now, the parser built the indexes from the copybook as show below (highlighted items of interest),
6 CARR_NCH_EDIT_CD_CNT D 102 1579 1580 2
6 CARR_NCH_PATCH_CD_I_CNT D 103 1581 1582 2
6 CARR_MCO_PRD_CNT D 104 1583 1583 1
6 CARR_CLM_DEMO_ID_CNT D 105 1584 1584 1
6 CARR_CLM_DGNS_CD_J_CNT D 106 1585 1586 2
6 CARR_CLM_LINE_CNT D 107 1587 1588 2
6 H_FILLER_9 108 1589 1592 4
4 CARR_CLM_VAR_GRP 248 1593 24193 22601 6 NCH_EDIT_GRP [] 112 1593 1657 65 ... ... 6 CARR_CLM_DGNS_GRP [] 131 2152 2259 108 8 NCH_DGNS_TRLR_IND_CD 129 2152 2152 1 8 CLM_DGNS_VRSN_CD 130 2153 2153 1 8 CLM_DGNS_CD 131 2154 2160 7 6 CARR_LINE_GRP [] 247 2260 24190 21931 8 NCH_LINE_TRLR_IND_CD 133 2260 2260 1
When i parsed my 1 record datafile with this copybook, it correctly identified the counts, the top 4 line group items as 0 count and the last 2 items having some item.
"CARR_NCH_EDIT_CD_CNT": 0,
"CARR_NCH_PATCH_CD_I_CNT": 0,
"CARR_MCO_PRD_CNT": 0,
"CARR_CLM_DEMO_ID_CNT": 0,
"CARR_CLM_DGNS_CD_J_CNT": 2,
"CARR_CLM_LINE_CNT": 1,
"H_FILLER_9": "0000"
And it filled the array for those line items as per the count, empty array for 0 count. But, group object with no values for the fields,
"CARR_CLM_VAR_GRP": { "NCH_EDIT_GRP": [], "NCH_PATCH_GRP": [], "MCO_PRD_GRP": [], "CLM_DEMO_ID_GRP": [], "CARR_CLM_DGNS_GRP": [{ "NCH_DGNS_TRLR_IND_CD": "", "CLM_DGNS_VRSN_CD": "", "CLM_DGNS_CD": "" }
Issue or bug: The issue or the bug that i identified with the debugger is that, the data file do not have any bytes for the empty line group items and just has the data bytes for the items that are supposed to be present. So, the indexes identified by and used by the parser to offset and extract value are incorrect. In this case, 6 NCH_EDIT_GRP [] 112 1593 1657 65 ... ... 6 CARR_CLM_DGNS_GRP [] 131 2152 2259 108
it is looking for the 0 count item _NCH_EDITGRP at index 1593 1657 and 2 count item _CARR_CLM_DGNSGRP at index 2152 2259.
However, the data file has no bytes for _NCH_EDITGRP and the data for _CARR_CLM_DGNSGRP is starting at index 1593 1657 instead of 2152 2259.
Thanks for the analysis. It is very helpful for our understanding of the issue.
Yes, currently if an array is present in a data file it is expected to always occupy the same space as it would take to store the maximum number of elements of such an array. Even if a particular instance of an array has 0 elements, the record is expected to have bytes for that array filled.
Up until now, we haven't encountered any copybooks that contain variable size arrays that shrink depending on the number of array elements. We can add support for such arrays in one of the future releases.
Just a couple of questions to clarify.
If you could provide a simplified example of a copybook and a data file so we are sure our implementation is aligned to your expectations, it would make it much easier for us.
@yruslan Thanks for the response.
For 1: Thats right. No bytes are present if the array count is 0
For 2: The records contains bytes only for the array count, that is 1. It doesnt contain bytes for the remaining 2 of the max(3)
In the sense, is it treating a VariableLength record same as FixedLength record?
Great. So the above is true for every array that has 'depending on' clause in the copybook, right?
We might implement it as an option for spark-cobol
, e.g. .option(...)
.
Is this ever not true? Every variable copybook/data file I have access to exhibits the exact same behavior for "occurs depending on" clauses.
Interesting. It is always false (the current behavior) for our copybooks.
I guess an option would be great then!
An option would be great to switch between FixedLength and VariableLength.
@tr11 @yruslan Thanks for looking into this. I will follow this enhancement. I have updated this gist with the sample data file, https://gist.github.com/chandrasekaravr/2cce4899b846e8bf82e02a5a077a54b0
It has the copybook and a sample data file that is of variable length and has RDW.
👍 Thanks for the data!
Please, try this snapshot and let me know if it worked for you:
<dependency>
<groupId>za.co.absa.cobrix</groupId>
<artifactId>spark-cobol</artifactId>
<version>1.0.1-SNAPSHOT</version>
</dependency>
You also need to use this option:
.option("variable_size_occurs", "true")
@yruslan Thank you. Will try this out.
@yruslan Thanks for the quick enhancement. I tried the 1.0.1.SNAPSHOT. And it parsed out the variable length data file with the varying array sizes beautifully.
I have a question with respect to multisegment data file processing. I experienced the same with your test5
test data also. This is what i observed,
Is this the expected result?
Happy the variable size arrays worked for you. Going to release 1.0.1 today with these changes.
Yes, this is the expected behavior since Copybooks do not contain information on how to distinguish between segments.
You can use redefine-segment-id-map
set of options to specify a mapping between segment redefines and segment ids. If you do so Cobrix will parse proper segments automatically. The usage is described in README, and also you can take a look at the same Test5 for an example. Look for:
.option("redefine_segment_id_map:0", "STATIC-DETAILS => C,D")
.option("redefine-segment-id-map:1", "CONTACTS => P")
@yruslan Thanks for the tip. It worked perfectly. I was using the _segmentfilter and _segmentfield` initially.
@yruslan Based on your suggestion from https://github.com/AbsaOSS/cobrix/issues/156#issuecomment-522518017, i tried with the prepare_release_1_0_0
This version is failing (with NumberFormatException) to parse the OccursTo section of my copybook,
04 CARR-CLM-VAR-GRP. 06 NCH-EDIT-GRP OCCURS 0 TO 13 TIMES DEPENDING ON CARR-NCH-EDIT-CD-CNT. 08 NCH-EDIT-TRLR-IND-CD PIC X. 08 NCH-EDIT-CD PIC X(4).
It is trying to parse the String TO13 instead of just 13
As an alternate: I initially tried with the current release tag 0.5.6. And as pointed out in the above linked issue, the RDW is correctly decoded and all other sections are getting decoded, except for the OCCURS...DEPENDING ON... sections. It is correctly identifying the count value from CARR-NCH-EDIT-CD-CNT and adds that many number repeating groups in the output. However, all the fields in the groups are empty.