AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 78 forks source link

Copybook.parseSimple not dropping the fillers in output ast #324

Closed codealways closed 2 years ago

codealways commented 4 years ago

String copybookContents = "01 RECORD. 05 FILLER PIC X(1). 05 COMPANY_PREFIX PIC X(3). 05 FILLER PIC X(1). 05 FILLER PIC X(1). 05 COMPANY_NAME PIC X(9)."

Group grp = CopybookParser.parseSimple(copybookContents, true,true,commentPolicy.apply(false,6,72)).ast();

in the above scenario as drop_value_fillers is true then the output ast also should not contain the FILLERS. But output ast is providing each column details.

yruslan commented 4 years ago

Thanks for the bug report. Dropping of fillers is a feature of 'spark-cobol'. Fillers should be deopped in Spark schema while still can be present in the AST. But your expectation makes sense. We can drop fillers from AST as well.

yruslan commented 2 years ago

Hi, I was revisiting this issue. Each AST element has isFiller flag, so you can always ignore all fields if the flag is true.

Is it necessary for your use case to actually remove fillers from the AST if you have this flag?

codealways commented 2 years ago

That's how we are handling in our code while parsing the ast. It's a good to have feature to drive via parsesimple

yruslan commented 2 years ago

Could you please elaborate on what are you trying to achieve?

Every use case that I can think of can be done by just ignoring fields (AST statements) for which isFiller == true.

codealways commented 2 years ago

We are currently parsing the copybook using ast and converting to json. We are using parsesimple to get the ast. if based on the user options if we can drop or retain the filler in ast then for us its not required to handle while converting back to json.

yruslan commented 2 years ago

Sure, but using node.isFiller can be used to achieve the same, isn't it? https://github.com/AbsaOSS/cobrix/blob/3014923981a2f466343cb9b464ec8244b0b0ad8c/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Statement.scala#L81

What's the code snippet that you are using?

yruslan commented 2 years ago

I see why you are asking for the feature. There are parameters that say 'dropGroupFiller' and 'dropValueFillers' but nothing actually dropped, just marked. It makes sense. We can implement the dropping as well

codealways commented 2 years ago

You are correct. Currently we are parsing and dropping the fillers by utilizing the isFiller. currently the parameter we are sending for dropping filler to parseSimple is not functional.

yruslan commented 2 years ago

This is implemented.

Please, check the latest master. Note that the signature of the method (parseSimple) has changed. It now reflects what is actually being done.

Spark schemas doe not support having 2 column names having tha same name so FILLERs need to be renamed in order to be retained. So the method allow controlling if fillers are going to be renamed, and a separate flag that controls if non-renamed fields should be dropped.

codealways commented 2 years ago

@yruslan If signature is changed then it will break our existing code if we update the version, we need to make it backward compatible. Can we have it overloaded.

yruslan commented 2 years ago

Okay, fair point. Will make it compatible

codealways commented 2 years ago

@yruslan thanks a lot. you can mark the existing parsesimple signature to @Deprecated

yruslan commented 2 years ago

The changes are in master - you can check. To retain the compatibility the behavior is the same by default. Use 'dropFillersFromAst = true' to actually drop fillers from the AST.

codealways commented 2 years ago

So as in parseSimple is still have a new parameter dropFillersFromAst in the signature so once we upgrade we have to provide the value in our calling code. In scala it won't ask as the default value is given for it as false but in Java we have to explicitly specify to false while calling.

yruslan commented 2 years ago

Yeah, but this is the only way to preserve backward compatible behavior. Since the method has default values it cannot be overloaded.

yruslan commented 2 years ago

2.4.8 is released and it has the fix.

Btw, Cobrix has converters project that currently just provides examples on how to convert mainframe files to JSON without Spark. Consider contributing your converter to the project :)

codealways commented 2 years ago

Sure I would be happy to contribute. Can I know the overall requirement and detail plan for it. Is there any high level design document or still it is in process. I can contribute their also. Any plans to have the code base in Java or it will be completely in scala :)

yruslan commented 2 years ago

We don't have an exact plan at the moment. Just ideas. The one that we might likely to implement is a command-line tool + library that allows converting mainframe files to JSON.

Yes, the project will continue to be in Scala.

codealways commented 2 years ago

Sure then I think of 2 requirements

  1. Converting copybook ast to corresponding json
  2. Converting copybook to a map with key as parent hierarchy and value as column name.