JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Feather file causes segfault in R #117

Open bjornerstedt opened 5 years ago

bjornerstedt commented 5 years ago

Saving a DataFrame with Feather causes R to crash when reading the file. I am using Feather 0.5.1 with Julia 1.1. If I create a simple feather file with

df = DataFrame(A = 1:8)
Feather.write("df.feather", df)

I get the following crash in R 3.5.1 with the package feather 0.3.2:

> library(feather)
Warning message:
package ‘feather’ was built under R version 3.5.2 
> ir = read_feather("df.feather")

 *** caught segfault ***
address 0x10ee5c0b0, cause 'memory not mapped'

Traceback:
 1: openFeather(path)
 2: feather(path)
 3: read_feather("df.feather")
Rudi79 commented 5 years ago

The problem seems to be https://github.com/JuliaData/FlatBuffers.jl/issues/38 On way to fix this is to pin the flatbuffer package at version 0.4.0

danielfm123 commented 4 years ago

Feather files created with julia are bigger than the same dataset created with R. Feather format is important in order to use multiple tools in an analytics workflow.

ExpandingMan commented 4 years ago

I'm not too surprised that the files created in Julia are bigger. As I recall, we've seen examples of some of the other writers automatically deciding to write dictionary encoded (i.e. compressed) columns. In this package we only do this if the original column is a CategoricalArray (i.e. already dictionary encoded). In some cases the resulting difference in file size can be quite huge.

To do the same, we'd need some sort of heuristic for deciding when to automatically use dictionary ecnoded columns.

nalimilan commented 4 years ago

You could also support PooledArrays, and expect people to use that when they want to save memory.