AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
136 stars 78 forks source link

Exception in thread "main" za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 16: Unable to parse the value of LEVEL. Numeric value expected, but 'CUSTOM-CHANGE-FLAGS-CNT' encountered #96

Closed geethab123 closed 5 years ago

geethab123 commented 5 years ago

I am getting the error Exception in thread "main" za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 16: Unable to parse the value of LEVEL. Numeric value expected, but 'CUSTOM-CHANGE-FLAGS-CNT' encountered

Also cobrix is going to work with occurs nested subgroups. How many nested subgroups are handled. Do I need to make any changes to the copybooks from mainframe to parse the copybooks. If any changes are required please let me know. Because I am having lot of issues with copybooks. I tried to parse many copybooks. But they are failing for some reason.

If there are any tips to parse the copybooks easily. Please let me know.

Appreciate your cooperation

yruslan commented 5 years ago

The first 5 characters of each line are considered comments and are ignored. At line 16 you have 3 <tab> characters instead of 5 spaces. That's the reason for the error.

geethab123 commented 5 years ago

thank you. Can you please answer my other questions also. I am working on this and stuck. I just started working on these files. If you help me the tips how the copybook files need to cleaned it helps me a lot.

Thanks a lot Appreciate your cooperation

geethab123 commented 5 years ago

When I parsed another copybook file schema is parsed but whiie parsing the data file I got the following error'. Please let me know how I can resolve this issue in data file Exception in thread "main" java.lang.IllegalArgumentException: There are some files in /user/abc_binary that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (3835 bytes per record). Check the logs for the names of the files. at za.co.absa.cobrix.spark.cobol.source.scanners.CobolScanners$.buildScanForFixedLength(CobolScanners.scala:87) at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:85) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:348)

yruslan commented 5 years ago

This error happens when you try to load a fixed record length file where the record length does not divide the file size. In your case, the file size should be evenly divisible by 3853.

This can happen

In case the file is a multisegment variable record length one, you need to add .option("is_record_sequence", "true"). In this case, the parser will expect 4-byte RDW headers for each record. The fields for that header should not be present in the copybook itself.

geethab123 commented 5 years ago

Hi,

Thanks a lot for your suggestion. Please let me know how can I check the log files. I have added the .option("is_record_sequence", "true") and with this jar I tried execute different file. I got the below error

ERROR FileUtils$: File hdfs://xyz/abc IS NOT divisible by 17163. Exception in thread "main" java.lang.IllegalArgumentException: There are some files in /user/vabc/binaryfile that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (17163 bytes per record). Check the logs for the names of the files. at za.co.absa.cobrix.spark.cobol.source.scanners.CobolScanners$.buildScanForFixedLength(CobolScanners.scala:87) at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:85)

yruslan commented 5 years ago

Could you please send the snippet of code you use for reading the file, e.g. the line that starts with spark.read(...)?

geethab123 commented 5 years ago

Hi below is the .scala class code I am using for parsing the mainframe copybook and data file. Please suggest me what changes I need to make in code or in the copybook or binary file to parse this correctly.

Thanks a lot for checking my issues and helping me to parse the mainframe file.

package com.cobrix

import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SparkSession import za.co.absa.cobrix.spark.cobol.source import za.co.absa.cobrix.spark.cobol._ import za.co.absa.cobrix.cobol.parser.CopybookParser import za.co.absa.cobrix.spark.cobol.schema.{CobolSchema, SchemaRetentionPolicy}

import za.co.absa.cobrix.spark.cobol.utils.SparkUtils

object cobrixtest extends Serializable { def main(args: Array[String]): Unit = { val sparkConf: SparkConf = new SparkConf().setAppName("cobrixtest") val v_copybook = args(1) val v_data = args(0) println(vcopybook) val spark: SparkSession = SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate() import spark.implicits. val cobolDataframe = spark .read .format("cobol") .option("generate_record_id", false) // this adds the file id and record id .option("is_record_sequence", "true") // reader to use 4 byte record headers to extract records from a mainframe file .option("schema_retention_policy","collapse_root") //removes the root record header .option("copybook", v_copybook) .load(v_data) cobolDataframe.printSchema() cobolDataframe.show(300,false) }

}

yruslan commented 5 years ago

Interesting. This error should not happen on variable record length files. Which version of Cobrix are you using?

geethab123 commented 5 years ago

I am using 0.4.2 cobrix version libraries. Also I was using the scala version 2.11.8. spark 2.1.1. Also this what I have. I can use lower version of scala. Please let me know what version of cobrix need to be used. Also what version of scala and spark need to be used.

yruslan commented 5 years ago

From my perspective everything looks good: the program, the version of Cobrix, Spark and Scala. The strange thing is that the error message you are having occurs only when reading fixed record length files. But .option("is_record_sequence", "true") should turn on variable record length reader that don't throw that particular error.

Is it possible to get an example data file and a copybook that causes that to reproduce the error at our side?

geethab123 commented 5 years ago

Hi,

Thank you so much for your reply. If everything is correct I will try some more time.

Thanks Geetha

On Mon, Jun 3, 2019, 1:34 AM Ruslan Yushchenko notifications@github.com wrote:

From my perspective everything looks good: the program, the version of Cobrix, Spark and Scala. The strange thing is that the error message you are having occurs only when reading fixed record length files. But .option("is_record_sequence", "true") should turn on variable record length reader that don't throw that particular error.

Is it possible to get an example data file and a copybook that causes that to reproduce the error at our side?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/cobrix/issues/96?email_source=notifications&email_token=AECSSJNB4TKKEN3AE7M7HKLPYS3ORA5CNFSM4HROKGOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWYN4MI#issuecomment-498130481, or mute the thread https://github.com/notifications/unsubscribe-auth/AECSSJIKQ4BC2XKPPIWB2N3PYS3ORANCNFSM4HROKGOA .

geethab123 commented 5 years ago

I have placed the jar file on the worker node and the copybook and binary files are in hdfs. Is this correct. Please confirm. I am getting below errors. for that I have added the line 000550 05 FILLER PIC X(04). AMTR010 in the copy book. but the same error is coming.

java.lang.IllegalStateException: RDW headers should never be zero (0,100,0,0). Found zero size record at 0. at za.co.absa.cobrix.cobol.parser.decoders.BinaryUtils$.extractRdwRecordSize(BinaryUtils.scala:305) at za.co.absa.cobrix.spark.cobol.reader.index.IndexGenerator$.getNextRecordSize(IndexGenerator.scala:136) at za.co.absa.cobrix.spark.cobol.reader.index.IndexGenerator$.sparseIndexGenerator(IndexGenerator.scala:58)

geethab123 commented 5 years ago

Can you please let me know how to fix the above issue. I am stuck here.

yruslan commented 5 years ago

For the 4 byte RDW header there should not be an entry in the copybook. So, please remove the FILLER.

But from what I can see from the values of the RDW header(0,100, 0, 0), is that it is possible that your RDW headers are big endian. To load files that have big endian RDW use this option: .option("is_rdw_big_endian", "true")

geethab123 commented 5 years ago

Thank you so much. I tried what you recommended.. I have removed the filler from the copybook. I have added .option("is_rdw_big_endian", "true") and ran it. Same error appears again. are there any options left for me to try for parsing my data file. If I am able to parse these files that helps me a lot.

yruslan commented 5 years ago

I would like to help, but unfortunately mainframe data files are very versatile. In order to parse a mainframe file we need to understand how records are placed in the data, which headers does the data file have, be sure that the copybook properly matches the data file, etc.

We have tried different combinations of options and I'm out of suggestions that can be just tried and checked. If you have a small example of a similar file and the corresponding copybook, we cold look at it and try to figure out what is needed to parse it properly.

geethab123 commented 5 years ago

Thank you so much for your time. According to my company policies I cannot share the data. I am not a mainframe guy so I cannot generate myself sample data.

geethab123 commented 5 years ago

how can i run the unit tests za.co.absa.cobrix.spark.cobol with all the unit tests with data files in the data folder. Can they run local or do we need to move all files to the hdfs and then run them. In hdfs how do we run the tests. Please help to how to check the log files also how to run the unit tests.

yruslan commented 5 years ago

All unit tests can be ran using mvn test or mvn clean test at project's root directory. It will run everything in local mode, no need to copy files to HDFS.

geethab123 commented 5 years ago

Hi, I am getting below error when I am packaging at the maven life cycle. Please let me know How to resolve this

01:26:47.163 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 01:26:47.523 ERROR org.apache.hadoop.util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393) at org.apache.hadoop.util.Shell.(Shell.java:386) at org.apache.hadoop.util.StringUtils.(StringUtils.java:79) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116) at org.apache.hadoop.security.Groups.(Groups.java:93) at org.apache.hadoop.security.Groups.(Groups.java:73) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2427) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2427) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427) at org.apache.spark.SparkContext.(SparkContext.scala:295)

yruslan commented 5 years ago

This is a known issue of running Spark on Windows. https://stackoverflow.com/a/39525952/1038282

geethab123 commented 5 years ago

Thank you for your reply.

geethab123 commented 5 years ago

I sent files to your email. Can you do the favor of checking them and let me know the issue for fix. Or what should I do to make the files parsing.

yruslan commented 5 years ago

Received the files, will take a look. It will take some time. Likely will get back to you for more questions. So far the copybook seems quite complex and the record structure of the file is not obvious. No guarantees I'd be able to figure it out.

What could also be very helpful if you can get the first couple of records of this file in a parsed format, like csv. It will be easier for me to figure out where one record ends and the other record begins.

geethab123 commented 5 years ago

Hi I have a quick question. Is cobrix supports nested occurs. If yes how many levels it supports.

yruslan commented 5 years ago

Yes, this should be supported with arbitrary number of levels. We didn't specifically tested this scenario, but the code is generic enough to cover this case.

geethab123 commented 5 years ago

For fixed width file when I am trying to parse I am getting the below error. Can you please help me how to fix this.

/ 495328 / / 495329 / mutableRow.update(0, value); / 495330 / } / 495331 / / 495332 / return mutableRow; / 495333 / } / 495334 / }

org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection has grown past JVM limit of 0xFFFF at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) at org.codehaus.janino.util.ClassFile.addConstantFieldrefInfo(ClassFile.java:342) at org.codehaus.janino.UnitCompiler.writeConstantFieldrefInfo(UnitCompiler.java:11109)

yruslan commented 5 years ago

It looks like you are reaching a limit of JVM on Spark's code generation stage. Try creating a Spark session with codegen turned off. Something like this:

  val spark: SparkSession = SparkSession.builder()
    .master("local[*]")
    .appName("Example")
    .config("spark.sql.codegen.wholeStage", value = false)
    .getOrCreate()
geethab123 commented 5 years ago

Thanks a lot. Your suggestion worked. Fixed width files are parsing. But I have another quick question. For fixed width file If I have filler at the end in the copybook file it is unable to parse the data file and data file has blanks at the end. Is there any fix for this. Please suggest me how to overcome this.

yruslan commented 5 years ago

It is great to hear that files are parsing now, at least partially. But I'm sorry, not completely following what is the issue. Could you please describe this using a simplified example?

geethab123 commented 5 years ago

Below is the example for fixed width file having fillers at the end. our copybook ends like 15 CUSTOM-STATUS PIC X(01).
15 FILLER PIC X(15). 15 FILLER PIC X(773).

and data file had blanks at the end. I am unable to parse this file as a fixed width file. blanks at the end is not handled getting the errolr NOT DIVISIBLE .....

If the file does not have FILLER at the end in the copybook I am able to parse the file.

yruslan commented 5 years ago

Do I understand it right that the file has 773 bytes at the end that should be ignored?

There is a feature planned to be introduced - file headers and footers. Using this new feature you can specify how many bytes to ignore at the beginning and at the end of a file. Will le t you know when thie feature is available.

geethab123 commented 5 years ago

Hi Ruslan,

I am trying to parse copybook which has 4000 columns and their binary file. I am again getting the thousands of lines of code on the screen and JVM error after adding .config("spark.sql.codegen.wholeStage", value = false). My spark version is 2.1.1 For me the solution is only to work with RDDs. Is there any other solution. I read online that spark 2.3 has the fix for this. Please let me know what are my options. Also I have another question. Is the cobrix works with spark 2.3 and spark 2.4 versions.

yruslan commented 5 years ago

Yes, newer versions of Spark handle wide column dataframes (with thouthands of columns) much better. Yes, you can use Spark 2.3 and Spark 2.4, as long as you use the version with Scala 2.11 (not Scala 2.12).

geethab123 commented 5 years ago

Thank you. But is there any way to work with spark 2.1 to fix this issue.

yruslan commented 5 years ago

Not sure. It depends on your exact use case. Handling wide dataframes definitely got better in 2.3. But the codegen error you got seems odd, So really hard to tell.

yruslan commented 5 years ago

The issue with 773 spaces is related to #87

geethab123 commented 5 years ago

We renamed FILLER to something else it worked. Thanks for your time.

geethab123 commented 5 years ago

Also my big lines of code scrolling in output for huge number of columns in copybook is solved when I used spark 2.4. But I was looking for something in 2.1.1

geethab123 commented 5 years ago

I have field in copybook with packed decimal fields having length as S9(X)V9(8). The values after parsing for these kinds of fields are coming as 0E-8 instead of 0. Is there any way we can fix this. Please advise.

yruslan commented 5 years ago

Short answer: 0 and 0E-8 are the same values. They are just displayed differently on a screen depending on a tool you use.

The picture of S9(10)V9(8). converts to a Spark decimal(10,8) value. It is a fixed point decimal type. I presume that for Spark methods, like df.show(), the scientific format is chosen so it would be clear for a viewer that the column has a decimal type.

What is your output format (Parquet, JSON, CSV, etc)? Does the scientific notation present in files themselves?

geethab123 commented 5 years ago

Thank you. I will check the files.