Not getting all columns while streaming parquet file from S3 or locally downloaded file reading

ZJONSSON / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format

MIT License

34 stars 61 forks source link

Not getting all columns while streaming parquet file from S3 or locally downloaded file reading #60

Open rajan596 opened 3 years ago

rajan596 commented 3 years ago

Hi,

I am using standard code mentioned below. I am not getting desired columns in cursor. Can anyone tell where is the issue in library ?

Desired columns in S3 parquet is A,B,C while I am not getting column A most of the records. While validating same parquet file downloading local and converting it to CSV A column value is present for all the dataset. Please help where can this go wrong ?

Version: "parquetjs-lite": "0.8.0", NodeJs version: v8.0.0

import parquet from 'parquetjs-lite/parquet'
let reader = await parquet.ParquetReader.openS3(s3Client, params);
let cursor = reader.getCursor();
while (record = await cursor.next()) {
 console.log(record)
}

rajan596 commented 3 years ago

Update: same is completely readable by using the python panda module. The parquet file present in s3 was originated by spark/python.

entitycs commented 3 years ago

Have you found any patterns in which rows have 'column A' data missing? Are they strings or numbers (eg. BigInt?)

Regards,

Dustin

-----Original message----- From: Rajan Kasodariya Sent: Sunday, January 24 2021, 6:24 am To: ZJONSSON/parquetjs Cc: Dustin Charles; Comment Subject: Re: [ZJONSSON/parquetjs] Not getting all columns while streaming parquet file from S3 or locally downloaded file reading (#60) Update: same is completely readable by using the python panda module. The parquet file present in s3 was originated by spark/python.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

rajan596 commented 3 years ago

@entitycs I can not share the parquerjs file but there was one pattern for all the parquet files which I tried to read. It was giving this field missing after 70k lines of data read. Not sure about the data type but it was in form of number.

entitycs commented 3 years ago

@rajan596 When you get a chance, can you try reverting to this commit, and attempt to read past the 70K mark again?

https://github.com/ZJONSSON/parquetjs/commit/5277eb866e3c4a76df78010a9df0859f79c665c0

Without being able to see the data, my hunch is that either the numbers in the field grow, and a compression algorithm not in the lite version takes on a different path, or there is an issue in lib/shred.js, which has differing iteration methods between HEAD and the above commit.