ddotta / parquetize

R package that allows to convert databases of different formats to parquet format
https://ddotta.github.io/parquetize/
62 stars 4 forks source link

fix: bug in bychunk logic #21

Closed nbc closed 1 year ago

nbc commented 1 year ago

parquetize::bychunk try to read after the end of the file.

On some files it works, on other it doesn't.

On fhe file https://www2.census.gov/programs-surveys/ahs/2021/AHS%202021%20National%20PUF%20v1.0%20Flat%20SAS.zip it doesn't :

> sas <- haven::read_sas("ahs2021n.sas7bdat")
> nrow(sas)
[1] 64141
> parquetize::table_to_parquet("ahs2021n.sas7bdat", "tmp/2", by_chunk = TRUE, chunk_size = 10000)
✔ The SAS file is available in parquet format under tmp/2/ahs2021n1-10000.parquet
✔ The SAS file is available in parquet format under tmp/2/ahs2021n10001-20000.parquet
✔ The SAS file is available in parquet format under tmp/2/ahs2021n20001-30000.parquet
✔ The SAS file is available in parquet format under tmp/2/ahs2021n30001-40000.parquet
✔ The SAS file is available in parquet format under tmp/2/ahs2021n40001-50000.parquet
✔ The SAS file is available in parquet format under tmp/2/ahs2021n50001-60000.parquet
✔ The SAS file is available in parquet format under tmp/2/ahs2021n60001-70000.parquet
Error: Failed to parse /home/nc/travail/R/Rexploration/rdata/sas/ahs2021n.sas7bdat: Invalid file, or file has unsupported features.
ddotta commented 1 year ago

Hi @nbc ! Great job and great PR !
Thanks