Closed bicycle1885 closed 6 years ago
Could you file an issue about the corruption you are seeing? That's not supposed to happen. Was it with the latest version?
Unfortunately, I cannot upload data to reproduce the problem. Here is a screenshot of it (I know this join
doesn't make sense):
frog1
is a data frame read from a feather file. gene
is a column to store gene names (strings). This is 100% reproducible on my machine and it goes away if I set weakrefstrigs
to false
when reading data. I'm using the latest release of Feather.jl (v0.3.0, but it seems to be identical to v0.3.1) and DataFrames.jl (v0.11.1) on Julia 0.6.1.
I've made a reproducible data file (gzip compressed to upload): test.feather.gz
Or, you can generate the same file with the following script.
using Feather
using DataFrames
M = 30
N = 50
srand(1234)
data = hcat(DataFrame(gene=["g.$(x)" for x in 1:M]), DataFrame(Dict(Symbol("cell", i) => rand(0:100, M) for i in 1:N)))
Feather.write("test.feather", data)
The MD5 hash value is 54d00ea81161fd54aafb409aaba9db7c.
Then you can reproduce data corruption as follows:
~/.j/v/Feather ((v0.3.0)|…) $ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.1 (2017-10-24 22:15 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-apple-darwin14.5.0
julia> using Feather
julia> test = Feather.read("test.feather");
julia> join(test, test, on=:gene)
30×101 DataFrames.DataFrame. Omitted printing of 90 columns
│ Row │ gene │ cell1 │ cell10 │ cell11 │ cell12 │ cell13 │ cell14 │ cell15 │ cell16 │ cell17 │ cell18 │
├─────┼───────────────┼───────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┼────────┤
│ 1 │ °©l │ 98 │ 3 │ 24 │ 98 │ 25 │ 67 │ 62 │ 19 │ 3 │ 90 │
│ 2 │ \0.2 │ 7 │ 53 │ 49 │ 31 │ 13 │ 89 │ 69 │ 95 │ 49 │ 13 │
│ 3 │ °¬l │ 38 │ 46 │ 94 │ 17 │ 65 │ 2 │ 55 │ 86 │ 78 │ 85 │
│ 4 │ Ð@ï │ 46 │ 86 │ 47 │ 27 │ 51 │ 86 │ 92 │ 4 │ 18 │ 80 │
│ 5 │ \x10qC │ 72 │ 33 │ 43 │ 67 │ 46 │ 87 │ 97 │ 35 │ 23 │ 89 │
│ 6 │ \0.6 │ 76 │ 99 │ 84 │ 95 │ 69 │ 88 │ 91 │ 90 │ 60 │ 23 │
│ 7 │ \0.7 │ 47 │ 91 │ 86 │ 91 │ 47 │ 79 │ 23 │ 0 │ 7 │ 32 │
│ 8 │ \0.8 │ 25 │ 81 │ 63 │ 53 │ 3 │ 73 │ 84 │ 30 │ 27 │ 82 │
│ 9 │ P¯l │ 58 │ 53 │ 6 │ 2 │ 15 │ 98 │ 46 │ 65 │ 54 │ 11 │
│ 10 │ °¯l\x1d │ 59 │ 76 │ 42 │ 93 │ 73 │ 94 │ 24 │ 79 │ 28 │ 22 │
│ 11 │ \u80G\u85\x1e │ 19 │ 61 │ 81 │ 16 │ 15 │ 53 │ 80 │ 6 │ 11 │ 74 │
│ 12 │ ðG\u85\x1e │ 24 │ 51 │ 19 │ 11 │ 3 │ 28 │ 51 │ 1 │ 67 │ 14 │
⋮
│ 18 │ \0.18 │ 62 │ 50 │ 65 │ 17 │ 67 │ 22 │ 61 │ 42 │ 86 │ 83 │
│ 19 │ \0.19 │ 28 │ 91 │ 22 │ 93 │ 58 │ 84 │ 67 │ 47 │ 26 │ 87 │
│ 20 │ \0.20 │ 62 │ 61 │ 77 │ 22 │ 78 │ 17 │ 28 │ 36 │ 98 │ 25 │
│ 21 │ \0.21 │ 59 │ 12 │ 69 │ 44 │ 100 │ 29 │ 74 │ 64 │ 69 │ 69 │
│ 22 │ \0.22 │ 34 │ 1 │ 48 │ 7 │ 31 │ 2 │ 27 │ 71 │ 21 │ 62 │
│ 23 │ \0.23 │ 60 │ 19 │ 26 │ 55 │ 51 │ 86 │ 43 │ 73 │ 20 │ 74 │
│ 24 │ \0.24 │ 27 │ 64 │ 6 │ 37 │ 48 │ 83 │ 8 │ 95 │ 80 │ 7 │
│ 25 │ \x03\0\0\0 │ 1 │ 33 │ 45 │ 20 │ 0 │ 80 │ 44 │ 42 │ 59 │ 90 │
│ 26 │ \x01\0\0\0 │ 87 │ 42 │ 8 │ 46 │ 84 │ 75 │ 42 │ 16 │ 43 │ 42 │
│ 27 │ \x02\0\0\0 │ 20 │ 63 │ 8 │ 95 │ 50 │ 44 │ 76 │ 38 │ 59 │ 99 │
│ 28 │ \x01\0\0\0 │ 18 │ 14 │ 75 │ 67 │ 14 │ 36 │ 92 │ 44 │ 93 │ 13 │
│ 29 │ \x02\0\0\0 │ 10 │ 86 │ 92 │ 68 │ 60 │ 29 │ 37 │ 17 │ 12 │ 72 │
│ 30 │ Ы}\x1e │ 14 │ 34 │ 46 │ 14 │ 71 │ 83 │ 36 │ 50 │ 62 │ 17 │
Can you reproduce the problem? I think this is a serious bug if it is reproducible.
Thanks for the reproducer, that's very useful. I can reproduce the problem locally, though it doesn't happen all the time.
We're all quite busy currently, but I'm sure @quinnj will eventually look at this.
Closing this issue as it is no longer applicable.
Could you a little bit clarify why this is no longer applicable?
Sorry, I've gotten a bit ahead of the documentation here. Stay tuned for updated docs!
Basically how the package works now is that by default it loads in dataframes with columns which are objects which you can think of as wrapped pointers allowing you to effortlessly and lazily load whatever data you want. So, columns of strings are now sort of analogous to WeakRefStringArray
. One key difference is that if v
is one of these columns, then v[a:b]
will return a proper Vector{String}
, so for the most part things will seem much more like they would have before with weakrefstrings=false
. You can still completely copy an entire Feather file into memory using Feather.materialize
. Like I said, this will all become much clearer when I update the docs, which I will do some time before next week.
You should not still be seeing corruption on master, it has been thoroughly tested. If you do, please feel free to open another issue!
Thank you for your explanation! I’ll try it tomorrow.
I often see data corruption of strings when I use Feather.jl and DataFrames.jl. I'm not sure why but it can be suppressed by setting
weakrefstrings=false
when reading feather files. If weak-ref strings are not safe in general, I think it should be turned off by default.