apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

table_to_parquet factor handling #44509

Open pomatim opened 3 days ago

pomatim commented 3 days ago

Describe the enhancement requested

Hi, Excellent package! I've just noticed that when using parquetize::table_to_parquet factors are not transformed into strings, but instead they're turned into numeric vectors (using the underlying factor coding). So for example if you have a labelled factor in Stata and you turn it into a parquet file using table_to_parquet, you lose the underlying value labels. Obviously one could load the Stata file into R and then turn it into parquet (and or change all factors to strings in R), but that means loading the original file into R memory first. If table_to_parquet could automatically turn all factors into strings it would save a huge amount of time...Thank you!

Component(s)

R

eitsupi commented 1 day ago

Here is the arrow package's issue tracker, are you sure the issue is with the arrow package? If so, a reproducible example could be posted to help solve the problem.

If you are looking for an issue tracker for the parquetize package, I think it is https://github.com/ddotta/parquetize