AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
136 stars 78 forks source link

ParserVisitor fails parsing OccursTo #172

Closed chandrasekaravr closed 5 years ago

chandrasekaravr commented 5 years ago

@yruslan Based on your suggestion from https://github.com/AbsaOSS/cobrix/issues/156#issuecomment-522518017, i tried with the prepare_release_1_0_0

This version is failing (with NumberFormatException) to parse the OccursTo section of my copybook, 04 CARR-CLM-VAR-GRP. 06 NCH-EDIT-GRP OCCURS 0 TO 13 TIMES DEPENDING ON CARR-NCH-EDIT-CD-CNT. 08 NCH-EDIT-TRLR-IND-CD PIC X. 08 NCH-EDIT-CD PIC X(4).

It is trying to parse the String TO13 instead of just 13

As an alternate: I initially tried with the current release tag 0.5.6. And as pointed out in the above linked issue, the RDW is correctly decoded and all other sections are getting decoded, except for the OCCURS...DEPENDING ON... sections. It is correctly identifying the count value from CARR-NCH-EDIT-CD-CNT and adds that many number repeating groups in the output. However, all the fields in the groups are empty.

tr11 commented 5 years ago

Could you paste/send the whole copybook? That snippet is parsed correctly on my machine.

chandrasekaravr commented 5 years ago

@yruslan I have pasted the whole copybook here. For the branch prepare_release_1_0_0, i had to remove the OF CARR-CLM-REC. from the DEPENDING ON

(replaced with a gist in the following comment)

tr11 commented 5 years ago

Can you use pastebin or a gist? The formatting (including line breaks) is relevant for this issue.

tr11 commented 5 years ago

Regarding the OF CARR-CLM-REC part, I've never seen the OF clause. Do you have some documentation to what it's supposed to do?

chandrasekaravr commented 5 years ago

@tr11 here is the link for the gist, https://gist.github.com/chandrasekaravr/2cce4899b846e8bf82e02a5a077a54b0

The OF part refers to the group above that contains the count field. For now, it can be ignored. I have updated the gist with a fixed copybook.

tr11 commented 5 years ago

I see, right now we just look for all the matching names so we'll need to add logic to the parser to handle the OF clause. That said, if there are no name clashes, that clause becomes redundant so probably not a priority for now.

Take a look at the PR #173 -- the TO13 was a bug in the code. That PR solves this issue.

chandrasekaravr commented 5 years ago

@tr11 Great. Thanks for taking care of the bug. The OF clause is not a blocker, as you said. Since there is no name clash, the parser is correctly identifying the group and creating an array of correct size. However, the group fields are empty. Any suggestions on that?

chandrasekaravr commented 5 years ago

@tr11 @yruslan i did further investigation into why the line items addressed in the copybook as DEPENDING ON and OCCURS TO items not being parsed by debugging into the parser. Here are my findings, please let me know on how to proceed further. Or, if any fixes are required.

Please refer to lines 134 onwards in the copy book, https://gist.github.com/chandrasekaravr/2cce4899b846e8bf82e02a5a077a54b0

The section indicates the counts and the following are the line group items referred by depending on items, as such,

           06 CARR-NCH-EDIT-CD-CNT     PIC 99.
           06 CARR-NCH-PATCH-CD-I-CNT  PIC 99.
           06 CARR-MCO-PRD-CNT         PIC 9.
           06 CARR-CLM-DEMO-ID-CNT     PIC 9.
           06 CARR-CLM-DGNS-CD-J-CNT   PIC 99.
           06 CARR-CLM-LINE-CNT        PIC 99.
           06 H-FILLER-9               PIC X(4).
         04 CARR-CLM-VAR-GRP.
           06 NCH-EDIT-GRP          OCCURS 0 TO 13 TIMES
            DEPENDING ON CARR-NCH-EDIT-CD-CNT.
             08 NCH-EDIT-TRLR-IND-CD   PIC X.
             08 NCH-EDIT-CD            PIC X(4).
           06 NCH-PATCH-GRP         OCCURS 0 TO 30 TIMES
            DEPENDING ON CARR-NCH-PATCH-CD-I-CNT.
             08 NCH-PATCH-TRLR-IND-CD
                                       PIC X.
             08 NCH-PATCH-CD           PIC XX.
             08 NCH-PATCH-APPLY-DT     PIC 9(8).

Basically it has 6 following line group items,

06 NCH-EDIT-GRP OCCURS 0 TO 13 TIMES DEPENDING ON CARR-NCH-EDIT-CD-CNT. .... 06 NCH-PATCH-GRP OCCURS 0 TO 30 TIMES DEPENDING ON CARR-NCH-PATCH-CD-I-CNT. ... 06 MCO-PRD-GRP OCCURS 0 TO 2 TIMES DEPENDING ON CARR-MCO-PRD-CNT. ... 06 CLM-DEMO-ID-GRP OCCURS 0 TO 5 TIMES DEPENDING ON CARR-CLM-DEMO-ID-CNT. ... 06 CARR-CLM-DGNS-GRP OCCURS 0 TO 12 TIMES DEPENDING ON CARR-CLM-DGNS-CD-J-CNT. ... 06 CARR-LINE-GRP OCCURS 1 TO 13 TIMES DEPENDING ON CARR-CLM-LINE-CNT. ...

Note: All of these line items are optional as the min count is 0.

Now, the parser built the indexes from the copybook as show below (highlighted items of interest),

6 CARR_NCH_EDIT_CD_CNT             D            102   1579   1580      2
6 CARR_NCH_PATCH_CD_I_CNT          D            103   1581   1582      2
6 CARR_MCO_PRD_CNT                 D            104   1583   1583      1
6 CARR_CLM_DEMO_ID_CNT             D            105   1584   1584      1
6 CARR_CLM_DGNS_CD_J_CNT           D            106   1585   1586      2
6 CARR_CLM_LINE_CNT                D            107   1587   1588      2
6 H_FILLER_9                                    108   1589   1592      4

4 CARR_CLM_VAR_GRP 248 1593 24193 22601 6 NCH_EDIT_GRP [] 112 1593 1657 65 ... ... 6 CARR_CLM_DGNS_GRP [] 131 2152 2259 108 8 NCH_DGNS_TRLR_IND_CD 129 2152 2152 1 8 CLM_DGNS_VRSN_CD 130 2153 2153 1 8 CLM_DGNS_CD 131 2154 2160 7 6 CARR_LINE_GRP [] 247 2260 24190 21931 8 NCH_LINE_TRLR_IND_CD 133 2260 2260 1

When i parsed my 1 record datafile with this copybook, it correctly identified the counts, the top 4 line group items as 0 count and the last 2 items having some item.

"CARR_NCH_EDIT_CD_CNT": 0,
"CARR_NCH_PATCH_CD_I_CNT": 0,
"CARR_MCO_PRD_CNT": 0,
"CARR_CLM_DEMO_ID_CNT": 0,
"CARR_CLM_DGNS_CD_J_CNT": 2,
"CARR_CLM_LINE_CNT": 1,
"H_FILLER_9": "0000"

And it filled the array for those line items as per the count, empty array for 0 count. But, group object with no values for the fields,

"CARR_CLM_VAR_GRP": { "NCH_EDIT_GRP": [], "NCH_PATCH_GRP": [], "MCO_PRD_GRP": [], "CLM_DEMO_ID_GRP": [], "CARR_CLM_DGNS_GRP": [{ "NCH_DGNS_TRLR_IND_CD": "", "CLM_DGNS_VRSN_CD": "", "CLM_DGNS_CD": "" }

Issue or bug: The issue or the bug that i identified with the debugger is that, the data file do not have any bytes for the empty line group items and just has the data bytes for the items that are supposed to be present. So, the indexes identified by and used by the parser to offset and extract value are incorrect. In this case, 6 NCH_EDIT_GRP [] 112 1593 1657 65 ... ... 6 CARR_CLM_DGNS_GRP [] 131 2152 2259 108

it is looking for the 0 count item _NCH_EDITGRP at index 1593 1657 and 2 count item _CARR_CLM_DGNSGRP at index 2152 2259.

However, the data file has no bytes for _NCH_EDITGRP and the data for _CARR_CLM_DGNSGRP is starting at index 1593 1657 instead of 2152 2259.

yruslan commented 5 years ago

Thanks for the analysis. It is very helpful for our understanding of the issue.

Yes, currently if an array is present in a data file it is expected to always occupy the same space as it would take to store the maximum number of elements of such an array. Even if a particular instance of an array has 0 elements, the record is expected to have bytes for that array filled.

Up until now, we haven't encountered any copybooks that contain variable size arrays that shrink depending on the number of array elements. We can add support for such arrays in one of the future releases.

Just a couple of questions to clarify.

  1. Let's say a copybook has an array of 0-3 with a 'depending on' clause. If the value of the dependent field is 0, then no bytes of that array is present in the data file, right?
  2. If (1) is true, what happens if the value of the dependent field is 1? Will the record contain only bytes for 1 element, or it will contain bytes for 3 (maximum) elements? Please, let me know if you need me to elaborate on the questions.

If you could provide a simplified example of a copybook and a data file so we are sure our implementation is aligned to your expectations, it would make it much easier for us.

chandrasekaravr commented 5 years ago

@yruslan Thanks for the response.

For 1: Thats right. No bytes are present if the array count is 0

For 2: The records contains bytes only for the array count, that is 1. It doesnt contain bytes for the remaining 2 of the max(3)

In the sense, is it treating a VariableLength record same as FixedLength record?

yruslan commented 5 years ago

Great. So the above is true for every array that has 'depending on' clause in the copybook, right?

We might implement it as an option for spark-cobol, e.g. .option(...).

tr11 commented 5 years ago

Is this ever not true? Every variable copybook/data file I have access to exhibits the exact same behavior for "occurs depending on" clauses.

yruslan commented 5 years ago

Interesting. It is always false (the current behavior) for our copybooks.

tr11 commented 5 years ago

I guess an option would be great then!

chandrasekaravr commented 5 years ago

An option would be great to switch between FixedLength and VariableLength.

chandrasekaravr commented 5 years ago

@tr11 @yruslan Thanks for looking into this. I will follow this enhancement. I have updated this gist with the sample data file, https://gist.github.com/chandrasekaravr/2cce4899b846e8bf82e02a5a077a54b0

It has the copybook and a sample data file that is of variable length and has RDW.

yruslan commented 5 years ago

👍 Thanks for the data!

yruslan commented 5 years ago

Please, try this snapshot and let me know if it worked for you:

<dependency>
      <groupId>za.co.absa.cobrix</groupId>
      <artifactId>spark-cobol</artifactId>
      <version>1.0.1-SNAPSHOT</version>
</dependency>

You also need to use this option:

.option("variable_size_occurs", "true")
chandrasekaravr commented 5 years ago

@yruslan Thank you. Will try this out.

chandrasekaravr commented 5 years ago

@yruslan Thanks for the quick enhancement. I tried the 1.0.1.SNAPSHOT. And it parsed out the variable length data file with the varying array sizes beautifully. I have a question with respect to multisegment data file processing. I experienced the same with your test5 test data also. This is what i observed,

  1. The data file is multisegmented, ie, multiple record types in the same file ( C for Static-Details and P for Contacts) [_ref: ../data/test5data]
  2. The copy book has the different record types varying groups defined as REDEFINES [_ref: ../data/test5copybook.cob]
  3. I expected either Static-Details OR Contacts to be present, depending on the segment ID, but the content is getting extracted as both groups (ofcourse, one being incorrect). The expected test results file also has such output for each record. [_ref: ../data/test5expected/test5.txt]

Is this the expected result?

yruslan commented 5 years ago

Happy the variable size arrays worked for you. Going to release 1.0.1 today with these changes.

Yes, this is the expected behavior since Copybooks do not contain information on how to distinguish between segments.

You can use redefine-segment-id-map set of options to specify a mapping between segment redefines and segment ids. If you do so Cobrix will parse proper segments automatically. The usage is described in README, and also you can take a look at the same Test5 for an example. Look for:

      .option("redefine_segment_id_map:0", "STATIC-DETAILS => C,D")
      .option("redefine-segment-id-map:1", "CONTACTS => P")
chandrasekaravr commented 5 years ago

@yruslan Thanks for the tip. It worked perfectly. I was using the _segmentfilter and _segmentfield` initially.