blaze / blaze

NumPy and Pandas interface to Big Data
blaze.pydata.org
BSD 3-Clause "New" or "Revised" License
3.18k stars 390 forks source link

Segmentation faults after small change to `Symbol` constructor argument #1523

Open kwmsmith opened 8 years ago

kwmsmith commented 8 years ago

@llllllllll this line here:

https://github.com/blaze/blaze/blob/master/blaze/expr/expressions.py#L307

Leads to problems in another project; essentially, that project used strings for symbol dshapes.

I'm in agreement that normalizing symbol dshapes up front is the right thing to do, but we may need to revert this change for the 0.10.2 release and put in the proper normalization for the 0.11 release.

I tried reverting that line, but it leads to more concerning issues that I'm having a really hard time understanding.

With that line changed to:

@copydoc(Symbol)
def symbol(name, dshape, token=None):
    return Symbol(name, dshape, token or 0)

I'm now seeing repeatable segmentation faults on both OS X and Linux:

https://travis-ci.org/blaze/blaze/jobs/136301015

Such a small change resulting in segfaults makes me think of the recent isidentical refactoring merge.

Can you help me make sense of this?

llllllllll commented 8 years ago

That line should just be normalizing the values to dshapes. Is your project calling Symbol directly. This is unsupported because the normalization needs to happen in symbol. The segfault is caused by a failure in the __getattr__ because the dshape is a string and not a datashape, so that recurses and stack overflows. Somewhere CPython is missing something like a Py_EnterRecursiveCall

kwmsmith commented 8 years ago

Essentially the project is serializing blaze expressions, storing them in a DB, and bringing them back for later use. So the Symbol class and it's arguments are stored in the DB, since that's how Blaze's to_tree() represents things.

With the dshape normalization introduced here--which is more robust--it made the old string-based-dshape expressions incompatible with the new format, and our tests revealed the discrepancy.

The reason I'm looking into reverting this one change temporarily is due to the effort required to migrate the DB to the new Symbol objects with dshapes rather than strings.

I think the larger point is that there is a third-party consumer of Blaze's serialization spec beyond blaze server and blaze client. Whenever there's a backwards incompatible change to that serialization format, we need a way to clearly identify when that happens.

It makes sense to me to:

@llllllllll @necaris Thoughts?

necaris commented 8 years ago

Just to be clear, we are suggesting a serialization protocol version here (separate from the Blaze version)? I like this plan!

kwmsmith commented 8 years ago

Just to be clear, we are suggesting a serialization protocol version here (separate from the Blaze version)? I like this plan!

Yes, the serialization protocol version would be independent from Blaze's version, and the serialization version would only increase when a backwards incompatible change was introduced. Ideally that version number will change slowly.

It's similar in concept to separate API and ABI version numbers for compiled libraries.

llllllllll commented 8 years ago

The expression memoization does not work if the inputs are not normalized. The memoization was more than a 2x speedup for many queries in our production app so I would really prefer not reverting this.

I think that this might be a lot of overhead to add for the server. I am suprised that the blaze.server serialization is being used by a third party consumer, maybe it should be private? imho we should have the flexibility to make changes as long as it is supported in the server and client. If we want a more robust, versioned serialization format that should live somewhere outside of the server.

I wouldn't mind having a more persistant format with built in version checking / migration. I would just like to name the key _version because the __*__ namespace is reserved for the interpreter's implementation details (I really want to rename __inputs__).

kwmsmith commented 8 years ago

@llllllllll I've created a new release-0.10.2 branch. Due to the needs of the downstream project, it has the bugfixes since 0.10.1 but does not include the isidentical / expression memoization work. Those improvements are still on master, and will be part of the 0.11 release.

Having a separate release branch is not ideal, but it was judged the best option given current time constraints. That release branch is only for the 0.10.2 release; once it's cut, no more development will occur on it.

I'm in favor of the memoization improvements, especially with their performance improvements; we just need more time to migrate existing blaze expressions to the new normalized versions in that project.

I think that this might be a lot of overhead to add for the server. I am suprised that the blaze.server serialization is being used by a third party consumer, maybe it should be private?

The view we have is that Blaze has at least two APIs for representing and interacting with expressions, both of which are public-facing: the programmatic API for blaze expressions, and their serialized forms.

We're using the serialized expression form in both a web client front-end, and in a storage backend. So the serialization protocol is definitely a public facing one.

imho we should have the flexibility to make changes as long as it is supported in the server and client. If we want a more robust, versioned serialization format that should live somewhere outside of the server.

How would that work? Can you describe your ideas here in more detail?

llllllllll commented 8 years ago

What I mean is that there are different constraints for serialization for long term storage and ephemeral transport. In the blaze server, we want to minimize the overhead and to do that we should be able to assume the client/server are on the same version and then go from there with a faster format. When storing things in a databas or on disk we will want to be able to migrate versions and recover all of the information with little assumptions. We could leave the blaze.server.serialization as an implementation detail of the server, and then provide a blaze.serialization for storage.

necaris commented 8 years ago

+1 to making this distinction between ephemeral transport and storage, although I would chime in that even for ephemeral transport there is a lot of utility in ensuring client & server are speaking the same protocol (even if that's just putting big flags in the documentation that the client-server stream is subject to change and versions must be checked -- like the Python pickle module).

skrah commented 8 years ago

Is there a small reproducer for the segfault on Linux? Travis CI reports the default stack size (8MB) on Linux, and CPython's default recursion limit is quite conservative for Linux.

If this is a CPython bug, I'd like to fix it (in 3.5/3.6 at least).