JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
44.95k stars 5.42k forks source link

serialize/deserialize require types to be defined #10305

Closed jakebolewski closed 9 years ago

jakebolewski commented 9 years ago
julia> @enum(F,g=0xffffffff, h)

julia> open("test.jls", "w") do io
       serialize(io, h)
       end

julia>
julia/base [jcb/enumerr●] » julia-dev
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+3566 (2015-02-23 18:27 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 4fc5695* (0 days old master)
|__/                   |  x86_64-apple-darwin14.1.0

julia> deserialize(open("test.jls"))
ERROR: UndefVarError: F not defined
 in deserialize at serialize.jl:500
 in handle_deserialize at serialize.jl:352
 in deserialize at serialize.jl:335
quinnj commented 9 years ago

Hmmm, tricky. I'm not super familiar with the serialization process or best practices there, but it seems that we need to treat Enums specially by tagging them as Enum when serializing and when deserializing, define the specific enum type if it's not already defined.

JeffBezanson commented 9 years ago

This has nothing to do with enums. serialize and deserialize don't define types; they assume the same definitions are present. This is a fundamental design decision, aimed at the case of sending objects between processors with the same global state. It wouldn't make sense to send the full definition of each type along with every message.

JeffBezanson commented 9 years ago

I don't think this behavior of (de)serialize is likely to change.

jakebolewski commented 9 years ago

That is fine, It would be nice to have a more robust built-in persistence mechanism in Base.

I raised this more of a usability issue. The implementation @enum(Foo, bar, baz) is opaque to the ordinary user so maybe some docs can be added which explains that the enum macro generates types behind the scenes.

tkelman commented 9 years ago

It would be nice to have a more robust built-in persistence mechanism in Base.

Why expend a lot of effort trying to re-implement hdf5 when it already exists and works well? Not sure avoiding the binary dependency there is worth trying to re-tread the same ground in pure Julia.

jiahao commented 9 years ago

Because HDF5 is not robust (at least, its standard libhdf5 implementation isn't). It is a great format for storing matrices you don't need to change anymore, but that's pretty much it. It is extremely vulnerable to file corruption from (say) unfinalized calls to libhdf5 (if your network connection to a process writing your data to a .h5 breaks, you play Russian roulette as to whether your file is still readable when you reconnect). There is no built-in versioning. There is no built-in mechanism for specifying custom data types (JLD basically serializes anything that isn't natively representable, making things unusably slow). Heck, strings aren't even properly supported as a native HDF5 data type. Have relational data? Support for tables is very basic; it is very slow to update an existing table if you need to do anything more sophisticated than a straight-up append. Good luck trying to do a join or merge without getting impatient and reading the table into memory, doing everything in core, then dumping it back into a new HDF5 file. Want concurrent writes? SOL.

tl;dr: HDF5 was not built for concurrency, or streaming data, or mutable data.

/rant

jiahao commented 9 years ago

Working with HDF5 has given me new appreciation for what database people do. HDF5 is a file format, not a database.

BTW I have to give props to PyTables for doing a staggeringly good job of ameliorating the most painful of these usability problems, but it is still fundamentally limited by libhdf5.

tkelman commented 9 years ago

Use it for what it's good at... if you want to design something new that's better at the things where hdf5 gives you trouble, go right ahead, but that problem doesn't need to be solved at a core language or standard library level. (Julia's not out to solve every well-trodden database problem in the world either...)

jiahao commented 9 years ago

Well, you did ask "why reinvent HDF5"...

Substantiating the premise of your statement, that a programming language whether keep itself at arm's length from mechanics of persistent storage, is one of the most interesting research questions of the decade IMO. Just about every database person I've spoken to has been pathologically averse to the idea of allowing arbitrary code execution on data stored within their systems. They would much rather handle all the computations by restricting users to express their needs in a DSL, which is whatever query language the DBMS exposes to the user. Data scientists find the very notion of writing PCA in SQL so ridiculous that the first stage of any serious data analysis pipeline is to dump everything out of the DBMS into CSV or some other convenient, non-curated format.

Granted, the current state of affairs does not automatically mean that programming languages should become databases. However, there are still interesting questions about expressing the needs of persistent storage layers within the infrastructure already present in a general purpose programing language.

are two examples that come to mind of places where domain specific languages remain firmly entrenched.

tkelman commented 9 years ago

Sure, ripe research topic (one that I have to admit I don't find particularly exciting, but very much not my field), fertile ground for library work. Blaze is working on some good ideas in this domain. So is AMPLab.

Simple data storage and unsettled research questions on concurrent, streaming, mutable database and analysis technology are rather different points on this spectrum.

jiahao commented 9 years ago

Users tend to outgrow simple tools. No time like the present to plan ahead.

johnmyleswhite commented 9 years ago

any serious data analysis pipeline

I'd be very cautious about making that statement. In my experience, SQL is by far the best language for expressing computations that are tied to production systems.

jiahao commented 9 years ago

"Serious", of course, is subjective.

I can only imagine what monstrous SQL queries people invent when what they really want to do can be better expressed as linear algebra.