gesistsa / rio

🐟 A Swiss-Army Knife for Data I/O
http://gesistsa.github.io/rio/
594 stars 77 forks source link

About the compression mechanism #400

Closed chainsawriot closed 2 months ago

chainsawriot commented 2 months ago

As there are many bugs (#395 #396 #399) after really testing this bunch of code (#354)

Other than the issues of compression detection, this

https://github.com/gesistsa/rio/blob/2fb1373a497f586dacce095300452ccc5a42cf37/R/compression.R#L25-L44

is super not robust (e.g. it won't return a file and o is still 0). Instead of using this, the plan is to use the battle-tested R.utils compression. Well, even data.table is using it and that's the reason for introducing R.utils in the first place #362 .

https://github.com/Rdatatable/data.table/blob/2487c61656335764980e478c323f7e6ce4e6d4ca/R/fread.R#L123

But the problem is that R.utils doesn't support xz. (But I think it should be okay to cut xz. No one complains about xz not functioning, although it is not functioning for so many years.)

chainsawriot commented 2 months ago

Despite the code has been added since 2016 e980077 , the (current) doc never says that .{format}.xz, .{format}.bzip, and .{format}.gz (except ".csv.gz") are supported.

https://github.com/gesistsa/rio/blob/144ebca1e337654ac7a9c7492082a9a25b558b12/R/export.R#L11

("compressed directory" is weird, probably means "compressed file.")

I think another solution is to clarify only ".{format}.zip", ".{formart}.tar", ".{format}.tar.gz", ".csv.gz" (via data.table::fwrite()) and ".csv.bz" (via data.table::fwrite()) are supported, because they are tested. Probably this is not a breaking change, because gzip, bzip2 and xz never work.

@schochastics Thoughts?

schochastics commented 2 months ago

If we properly document what is supported and what not we can take the "not-support-xz" road and use R.utils to fix the remaining compression

chainsawriot commented 2 months ago

Cool. Then let's do it.