AbsaOSS / cobrix

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
Apache License 2.0
135 stars 80 forks source link

How to read a pipe separated file with Cobrix #677

Open pinakigit opened 1 month ago

pinakigit commented 1 month ago

I have a file in Mainframe which is pipe delimited and has a header with column names which are pipe separated too. The file length is Fixed length and after the last field all are filled with spaces.

I can ftp this file as ASCII as well as Binary from Mainframe. Is there a way to read this file in Cobrix as it doesn't have a copybook and there is no field length for the fields.

yruslan commented 1 month ago

Hi, could you send an example of such a file with a copybook? It seems odd that the file is both pipe-separated and fixed record length.

pinakigit commented 1 month ago

Sample below. Here every record is of length of 228

Index|Customer Id|First Name|Last Name|Company|City|Country|Phone 1|Phone 2|Email|Subscription Date|Website
1|DD37Cf93aecA6Dc|Sheryl|Baxter|Rasmussen Group|East Leonard|Chile|229.077.5154|397.884.0519x718|zunigavanessa@smith.info|2020-08-24|http://www.stephenson.com/
2|1Ef7b82A4CAAD10|Preston|Lozano|Vega-Gentry|East Jimmychester|Djibouti|5153435776|686-620-1820x944|vmata@colon.com|2021-04-23|http://www.hobbs.com/
3|6F94879bDAfE5a6|Roy|Berry|Murillo-Perry|Isabelborough|Antigua and Barbuda|+1-539-402-0259|(496)978-3969x58947|beckycarr@hogan.com|2020-03-25|http://www.lawrence.com/

yruslan commented 1 month ago

The file format looks like a pipe-delimited CSV. You can use spark-csv to convert it into a DataFrame: https://spark.apache.org/docs/latest/sql-data-sources-csv.html

val df = spark.read
  .format("csv")
  .option("header", "true")
  .option("delimiter", "|")
  .option("inferSchema", "true")
  .load("/path/to/file/or/folder")

df.show()