AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
137 stars 78 forks source link

Rewrite the COBOL parser #39

Closed yruslan closed 5 years ago

yruslan commented 5 years ago

The current parser does it's job, but it is written without conforming to the usual practice of parsing.

We need to write a grammar for the subset of COBOL that we are parsing and generate the parser using a tool like ANTLR, JavaCC, etc.

tr11 commented 5 years ago

I can probably help with this. I just rewrote a parser I build some time ago with boost.spirit to use ANTLR. This is loosely based on the grammar at https://github.com/uwol/proleap-cobol-parser. Here's a repo with the code -- it successfully parses all the copybooks I have access to.

yruslan commented 5 years ago

Great, Thanks! We will take a look at that.

tr11 commented 5 years ago

Taking a stab at this issue on this branch. If you have a chance, could you take a quick look for some early feedback?

yruslan commented 5 years ago

Wow, there are a lot of changes already. Will take a look. Thank you very much!

tr11 commented 5 years ago

It currently passes all the tests for the cobol-parser except for the PIC 9(n)9 one (is this a valid pic?). Still looking into a few failed tests on the spark-cobol side.

yruslan commented 5 years ago

Great, Thanks! Not sure if PIC 9(n)9 is valid, it was added for completeness, probably. I plan to work extensively on Cobrix starting from about the end of the next week. Will look at the rewritten parser. This is so cool that Cobrix can finally have a parser with the proper grammer definition!

tr11 commented 5 years ago

The parser on my branch doesn't recognize that kind of PIC, but should be easy enough to add it. I've never seen it before, but my COBOL knowledge comes from the copybooks and data files I have access to.

yruslan commented 5 years ago

We haven't encountered such a PIC either. So let's declare it invalid for now to make things a little bit easier.

tr11 commented 5 years ago

All the tests pass on my branch except for Test6TypeVarietySpec. It seems that my logic for PICs with signs is different than what we have right now.

I was expecting EX-NUM-INT02 to be Integral(9(8)+,8,Some(Right),true,None,None,Some(EBCDIC()))) but getting from the current schema Integral(9(8)+,8,None,true,None,None,Some(EBCDIC()))

Is it the case that the + in 9(8)+ or +9(8) is just for display and not for storage?

yruslan commented 5 years ago

The location of the '+' symbol should be for storage. I think this didn't came out as a bug because Cobrix removes all '+' symbols when parsing DISPLAY numbers. So if a '+' is located to the left or to the right side of the number does not matter.

But for the clarity of the parser I think the parser should return what you are expecting, e.g. Integral(9(8)+,8,Some(Right),true,None,None,Some(EBCDIC())))