JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Should weakrefstrings be false by default? #68

Closed bicycle1885 closed 6 years ago

bicycle1885 commented 6 years ago

I often see data corruption of strings when I use Feather.jl and DataFrames.jl. I'm not sure why but it can be suppressed by setting weakrefstrings=false when reading feather files. If weak-ref strings are not safe in general, I think it should be turned off by default.

nalimilan commented 6 years ago

Could you file an issue about the corruption you are seeing? That's not supposed to happen. Was it with the latest version?

bicycle1885 commented 6 years ago

Unfortunately, I cannot upload data to reproduce the problem. Here is a screenshot of it (I know this join doesn't make sense): feather

frog1 is a data frame read from a feather file. gene is a column to store gene names (strings). This is 100% reproducible on my machine and it goes away if I set weakrefstrigs to false when reading data. I'm using the latest release of Feather.jl (v0.3.0, but it seems to be identical to v0.3.1) and DataFrames.jl (v0.11.1) on Julia 0.6.1.

bicycle1885 commented 6 years ago

I've made a reproducible data file (gzip compressed to upload): test.feather.gz

Or, you can generate the same file with the following script.

using Feather
using DataFrames

M = 30
N = 50
srand(1234)
data = hcat(DataFrame(gene=["g.$(x)" for x in 1:M]), DataFrame(Dict(Symbol("cell", i) => rand(0:100, M) for i in 1:N)))
Feather.write("test.feather", data)

The MD5 hash value is 54d00ea81161fd54aafb409aaba9db7c.

Then you can reproduce data corruption as follows:

~/.j/v/Feather ((v0.3.0)|…) $ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-apple-darwin14.5.0

julia> using Feather

julia> test = Feather.read("test.feather");

julia> join(test, test, on=:gene)
30×101 DataFrames.DataFrame. Omitted printing of 90 columns
│ Row │ gene          │ cell1 │ cell10 │ cell11 │ cell12 │ cell13 │ cell14 │ cell15 │ cell16 │ cell17 │ cell18 │
├─────┼───────────────┼───────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤
│ 1   │ °©l           │ 98    │ 3      │ 24     │ 98     │ 25     │ 67     │ 62     │ 19     │ 3      │ 90     │
│ 2   │ \0.2          │ 7     │ 53     │ 49     │ 31     │ 13     │ 89     │ 69     │ 95     │ 49     │ 13     │
│ 3   │ °¬l           │ 38    │ 46     │ 94     │ 17     │ 65     │ 2      │ 55     │ 86     │ 78     │ 85     │
│ 4   │ Ð@ï           │ 46    │ 86     │ 47     │ 27     │ 51     │ 86     │ 92     │ 4      │ 18     │ 80     │
│ 5   │ \x10qC        │ 72    │ 33     │ 43     │ 67     │ 46     │ 87     │ 97     │ 35     │ 23     │ 89     │
│ 6   │ \0.6          │ 76    │ 99     │ 84     │ 95     │ 69     │ 88     │ 91     │ 90     │ 60     │ 23     │
│ 7   │ \0.7          │ 47    │ 91     │ 86     │ 91     │ 47     │ 79     │ 23     │ 0      │ 7      │ 32     │
│ 8   │ \0.8          │ 25    │ 81     │ 63     │ 53     │ 3      │ 73     │ 84     │ 30     │ 27     │ 82     │
│ 9   │ P¯l           │ 58    │ 53     │ 6      │ 2      │ 15     │ 98     │ 46     │ 65     │ 54     │ 11     │
│ 10  │ °¯l\x1d       │ 59    │ 76     │ 42     │ 93     │ 73     │ 94     │ 24     │ 79     │ 28     │ 22     │
│ 11  │ \u80G\u85\x1e │ 19    │ 61     │ 81     │ 16     │ 15     │ 53     │ 80     │ 6      │ 11     │ 74     │
│ 12  │ ðG\u85\x1e    │ 24    │ 51     │ 19     │ 11     │ 3      │ 28     │ 51     │ 1      │ 67     │ 14     │
⋮
│ 18  │ \0.18         │ 62    │ 50     │ 65     │ 17     │ 67     │ 22     │ 61     │ 42     │ 86     │ 83     │
│ 19  │ \0.19         │ 28    │ 91     │ 22     │ 93     │ 58     │ 84     │ 67     │ 47     │ 26     │ 87     │
│ 20  │ \0.20         │ 62    │ 61     │ 77     │ 22     │ 78     │ 17     │ 28     │ 36     │ 98     │ 25     │
│ 21  │ \0.21         │ 59    │ 12     │ 69     │ 44     │ 100    │ 29     │ 74     │ 64     │ 69     │ 69     │
│ 22  │ \0.22         │ 34    │ 1      │ 48     │ 7      │ 31     │ 2      │ 27     │ 71     │ 21     │ 62     │
│ 23  │ \0.23         │ 60    │ 19     │ 26     │ 55     │ 51     │ 86     │ 43     │ 73     │ 20     │ 74     │
│ 24  │ \0.24         │ 27    │ 64     │ 6      │ 37     │ 48     │ 83     │ 8      │ 95     │ 80     │ 7      │
│ 25  │ \x03\0\0\0    │ 1     │ 33     │ 45     │ 20     │ 0      │ 80     │ 44     │ 42     │ 59     │ 90     │
│ 26  │ \x01\0\0\0    │ 87    │ 42     │ 8      │ 46     │ 84     │ 75     │ 42     │ 16     │ 43     │ 42     │
│ 27  │ \x02\0\0\0    │ 20    │ 63     │ 8      │ 95     │ 50     │ 44     │ 76     │ 38     │ 59     │ 99     │
│ 28  │ \x01\0\0\0    │ 18    │ 14     │ 75     │ 67     │ 14     │ 36     │ 92     │ 44     │ 93     │ 13     │
│ 29  │ \x02\0\0\0    │ 10    │ 86     │ 92     │ 68     │ 60     │ 29     │ 37     │ 17     │ 12     │ 72     │
│ 30  │ Ы}\x1e       │ 14    │ 34     │ 46     │ 14     │ 71     │ 83     │ 36     │ 50     │ 62     │ 17     │
bicycle1885 commented 6 years ago

Can you reproduce the problem? I think this is a serious bug if it is reproducible.

nalimilan commented 6 years ago

Thanks for the reproducer, that's very useful. I can reproduce the problem locally, though it doesn't happen all the time.

We're all quite busy currently, but I'm sure @quinnj will eventually look at this.

ExpandingMan commented 6 years ago

Closing this issue as it is no longer applicable.

bicycle1885 commented 6 years ago

Could you a little bit clarify why this is no longer applicable?

ExpandingMan commented 6 years ago

Sorry, I've gotten a bit ahead of the documentation here. Stay tuned for updated docs!

Basically how the package works now is that by default it loads in dataframes with columns which are objects which you can think of as wrapped pointers allowing you to effortlessly and lazily load whatever data you want. So, columns of strings are now sort of analogous to WeakRefStringArray. One key difference is that if v is one of these columns, then v[a:b] will return a proper Vector{String}, so for the most part things will seem much more like they would have before with weakrefstrings=false. You can still completely copy an entire Feather file into memory using Feather.materialize. Like I said, this will all become much clearer when I update the docs, which I will do some time before next week.

You should not still be seeing corruption on master, it has been thoroughly tested. If you do, please feel free to open another issue!

bicycle1885 commented 6 years ago

Thank you for your explanation! I’ll try it tomorrow.