AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
138 stars 78 forks source link

Issue with using with_input_file_name_col option #606

Closed saikumare-a closed 1 year ago

saikumare-a commented 1 year ago

Hi Team,

getting below error while trying to get filename. below is the code used. please correct if anything missed.

final_options:{ 'copybook': '\<copybook>', 'generate_record_id': 'false', 'drop_value_fillers': 'false', 'drop_group_fillers': 'false', 'pedantic': 'true', 'debug': 'string', 'filler_naming_policy': 'previous_field_name', 'with_input_file_name_col': 'file_name', 'encoding': 'ascii', 'record_format': 'D', 'ascii_charset': 'ISO-8859-1', 'variable_size_occurs': 'true' }

Code: print(f"final_options:{final_options}") import pyspark.sql.functions as F df = spark.read.format("cobol").options(**final_options).load('\<file>') df=df.withColumn("input_file_name", F.input_file_name()) df.display()

Error IllegalArgumentException: Option 'with_input_file_name_col' is supported only when one of this holds: 'record_format' = V or 'record_format' = VB or 'record_format' = D or 'is_record_sequence' = true or one of these options is set: 'record_length_field', 'file_start_offset', 'file_end_offset' or a custom record extractor is specified

Thanks, Saikumar

yruslan commented 1 year ago

Hi, it seems the documentation is outdated. Will update it soon. For the record_format="D" most of the time you can use

df.withColumn("file_name", input_file_name())

instead of option('with_input_file_name_col': 'file_name').

This is a quite recent change.

saikumare-a commented 1 year ago

tested on 2.6.2 and getting blank value instead of actual name

yruslan commented 1 year ago

Try 2.6.5.

Works for me

 spark
  .read
  .format("cobol")
  .option("copybook_contents", copybook)
  .option("pedantic", "true")
  .option("record_format", "D")
  .option("schema_retention_policy", "collapse_root")
  .option("ascii_charset", "ISO-8859-1")
  .option("generate_record_id", false)
  .option("variable_size_occurs", true)
  .option("drop_value_fillers", false)
  .option("drop_group_fillers", false)
  .option("debug", "string")
  .option("filler_naming_policy", "previous_field_name")
  .load(tmpFileName)
  .withColumn("f", input_file_name()).show

+-----+-------+--------------------+
|    A|A_debug|                   f|
+-----+-------+--------------------+
|12.34|   1234|file:/var/folders...|
| null|       |file:/var/folders...|
+-----+-------+--------------------+
saikumare-a commented 1 year ago

we are using 2.6.2 across the platform. testing or upgrading to 2.6.5 is a big effort for us.

is it possible for you to check on 2.6.2?

yruslan commented 1 year ago

Yes, 2.6.2 produced blanks for me as well.

I think upgrading to 2.6.5 is the only option in this case since the issue has been fixed there.

saikumare-a commented 1 year ago

Sure, thanks for reviewing on 2.6.2. and confirming. we can close the question

Thanks for your support as always.