Open rajan596 opened 3 years ago
Update: same is completely readable by using the python panda module. The parquet file present in s3 was originated by spark/python.
Have you found any patterns in which rows have 'column A' data missing? Are they strings or numbers (eg. BigInt?)
Regards,
Dustin
-----Original message----- From: Rajan Kasodariya Sent: Sunday, January 24 2021, 6:24 am To: ZJONSSON/parquetjs Cc: Dustin Charles; Comment Subject: Re: [ZJONSSON/parquetjs] Not getting all columns while streaming parquet file from S3 or locally downloaded file reading (#60) Update: same is completely readable by using the python panda module. The parquet file present in s3 was originated by spark/python.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
@entitycs I can not share the parquerjs file but there was one pattern for all the parquet files which I tried to read. It was giving this field missing after 70k lines of data read. Not sure about the data type but it was in form of number.
@rajan596 When you get a chance, can you try reverting to this commit, and attempt to read past the 70K mark again?
https://github.com/ZJONSSON/parquetjs/commit/5277eb866e3c4a76df78010a9df0859f79c665c0
Without being able to see the data, my hunch is that either the numbers in the field grow, and a compression algorithm not in the lite version takes on a different path, or there is an issue in lib/shred.js, which has differing iteration methods between HEAD and the above commit.
Hi,
I am using standard code mentioned below. I am not getting desired columns in cursor. Can anyone tell where is the issue in library ?
Desired columns in S3 parquet is A,B,C while I am not getting column A most of the records. While validating same parquet file downloading local and converting it to CSV A column value is present for all the dataset. Please help where can this go wrong ?
Version: "parquetjs-lite": "0.8.0", NodeJs version: v8.0.0