JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.82k stars 5.49k forks source link

Better integration of "foreign types"? #36770

Closed fingolfin closed 1 year ago

fingolfin commented 4 years ago

Two years ago, our PR was accepted which among other things added jl_new_foreign_type to the Julia kernel. We've been relying on that work since then in GAP.jl.

However, it has one annoying limitation: these foreign types are not visible to precompilation. We worked around this by adding an abstract supertype GapObj. Instead of the actual foreign type, we then use GapObj throughout our Julia code. That was fine until we started to write higher level Julia types which contain GapObj members: As we discovered, by having x::GapObj instead of x::ActualConcreteType, various optimization are disabled. I am now looking into working around this by parametrizing all our structs (not sure whether that will work out). But I've also wondered if it might be possible to enhance precompilation to deal with our foreign type. Thinking extremely naively, perhaps one could add a syntax extension foreign type MyForeignType end or some other means (say, some special macro/function call/whatever) to inform the precompiler about the foreign type? It could then parse code referencing this, and would trust that the __init__ section of the package in which it occurs will actually provide such an external type. However, I don't even know whether that is enough information about the type for the precompiler, perhaps it needs much more; perhaps some of it could be added, perhaps others make the whole idea impossible. I clearly don't know enough about how precompilation works. Before I try to dig deeper and find out, I thought I could ask the experts whether this sounds even remotely realistic to achieve? Or perhaps I am overlooking a much simpler solution?

UPDATE 17. June 2022: Just to say, things have evolved quite a bit since I opened this issue. In particular, the foreign type is now initialized in GAP_jll, and this seems enough to enable precompilation in GAP.jl, which uses GAP_jll. So all is good from that point of view. It would still be nice to be able to implement more, e.g. having some kind of interface to enable handling (de)serialization of instances of this type. I realize it'll be up to us to provide one.

vtjnash commented 4 years ago

Since this is supposed to be a "fake" type, that just lets the GC read the layout field, its should be fine to teach precompile just to make an actual copy of the object whenever it encounters one of these DataTypes.

fingolfin commented 4 years ago

@vtjnash apologies, but I don't quite follow: "make an actual copy of the object" -> what does "the object" refer to here?

As far I understand things, there are no instances of the foreign type during precompilation -- indeed, I assume the foreign type does not even exist during precompilation, as __init__ has not yet run, and hence jl_new_foreign_type has not yet been called. Am I wrong?

Assuming I am right, it seems I need a way to tell Julia earlier about the to-be-created foreign type, by inserting something into the code which informs it that MyForeignType, if encountered, refers to a foreign type. My guess is that this might then also affect things like the layout of structs? My (again, naive) idea would be that any MyForeignType member would be stored like an Any, i.e. as a pointer to the actual object (as nothing can be said about the size of instances).

To illustrate the concrete issue: Right now, we have code like this, where GapObj is the supertype of the foreign type:

import GAP
import GAP: GapObj
struct Group
  x::GapObj 
end
foo(G::Group)::Bool = GAP.Groups.IsAbelian(G.x)
G = Group(GAP.Globals.CyclicGroup(10))
# very silly example
function bar(G::Group)
  x = true
  for i = 1:10000
    x = x && foo(G)
  end
end

Unfortunately, just to be able to set GapObj as supertype for our foreign type requires some hacking in some scenarios (namely in the one where Julia is embedded into GAP; then we have to create our foreign type before GAP.jl is loaded, and hence I had to resort to a nasty hack to change its supertype later on, once GAP.jl is loaded; see also https://discourse.julialang.org/t/swizzling-the-super-type-of-a-foreign-julia-type-or-how-evil-must-i-be/33733).

I also wonder whether if we were able to directly write MyForeignType instead of GapObj in those structs, some code could be more performant as it would enable better optimizations.... Though perhaps that's wrong; or perhaps even if it is right, using parametrized types can help mitigate this. I am still experimenting and running tons of benchmarks

timholy commented 2 years ago

Looks like we may have to solve this if we want #43990. By "make a copy" I assume you mean an opaque buffer? How do you figure out how big the object is? I don't see that in the arguments to jl_new_foreign_type.

fingolfin commented 2 years ago

@timholy I don't understand what "make a copy" refers to? @vtjnash brought it up, I asked for clarification, but so far got none.

In general, we can of course extend the foreign type API to cover new needs by Julia; if you can tell me what is needed, I can help come up with a way to provide it.

Note that the description of how things work in Oscar resp. GAP.jl is no longer fully accurate: we've worked around many of the issues we had in the past by moving the initialization of the foreign type into GAP_jll. Then GAP.jl refers to it, which has the advantage that the foreign type is already known to Julia, and GAP.jl and other packages using it can now use the foreign type GapObj directly inside of structs, method arguments etc. The only thing that doesn't work is doing something like const myGapObj = func_that_returns_GapObj() at the top level, because there is currently no way to serialize such an object (new kernel APIs would have to be added for that, I think). Anyway, we are not really unhappy about that right now.

Actually, there is also some hackery going on: for aesthetic reasons, we modify the foreign type in GAP.jl to changes its parent module from GAP_jll to GAP; this is mostly to ensure users are less confused; we tried to do with a custom show method instead in the past, but then e.g. Vector{GapObj} still was printed as Vector{GAP_jll.GapObj}, and we eventually gave up. If this hack is a problem, we can remove it, though (and I'd appreciate any pointers on "how to do it properly" if there are any...)

Internally, there are actually three foreign types, but only one (GapObj) is ever directly interacted with; it is basically a glorified Ref which only stores a pointer to the actual underlying foreign objects, and nothing else.

I could explain more, but that's probably besides the point. I had a look at PR #43990 but could not determine from that what the exact problem is there resp. how it relates to Oscar.jl / GAP.jl / GAP_jll / foreign types. But we are certainly happy to discuss ways to move this forward

timholy commented 2 years ago

Here's the error:

julia> using Oscar
fatal: error thrown and no exception handler available.
ErrorException("Cannot serialize instances of foreign datatypes")
ijl_error at /home/tim/src/julia-master/src/rtutils.c:41
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:978
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:605 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:778
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:605 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:829
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:605 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:660
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:605 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:850
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:605 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:660
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:605 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:850
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:605 [inlined]
serialize_htable_keys at /home/tim/src/julia-master/src/dump.c:1033 [inlined]
ijl_save_incremental at /home/tim/src/julia-master/src/dump.c:2578
jl_write_compiler_output at /home/tim/src/julia-master/src/precompile.c:65
ijl_atexit_hook at /home/tim/src/julia-master/src/init.c:207
jl_repl_entrypoint at /home/tim/src/julia-master/src/jlapi.c:707
main at /home/tim/src/julia-master/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/tim/src/julia-master/usr/bin/julia (unknown line)

Debugging statements show this occurs while serializing GET_FUNC_PTR(::GAP.GapObj, ::Int64), which gets compiled when Oscar is built. Current versions of Julia can't serialize it because it's precompiled in Oscar but owned by GAP, and that inability to serialize external methods is what saves you from hitting the fact that Julia can't serialize foreign types. Were you to force compilation of that method in GAP (e.g., add precompile(GET_FUNC_PTR, (GapObj, Int)) to GAP's source code), you'd hit the same error on current versions of Julia. The point of #43990 is to be able to serialize virtually any precompiled code regardless of ownership, which is why we hit the bug.

timholy commented 2 years ago

It seems one could handle this by adding two new fields to jl_fielddescdyn_t, serialize and deserialize that would act similarly to jl_serialize_value and jl_deserialize_value. Thoughts?

fingolfin commented 2 years ago

So why is serializing GAP_FUNC_PTR a problem then? Looking at the code:

function ADDR_OBJ(obj::GapObj)
    mptr = Ptr{Ptr{Csize_t}}(pointer_from_objref(obj))
    return unsafe_load(mptr)
end
...
function GET_FUNC_PTR(obj::GapObj, narg::Int)
    #@assert TNUM_OBJ(obj) == T_FUNCTION
    #@assert 0 <= narg && narg <= 7
    bag_ptr = Ptr{Ptr{Nothing}}(ADDR_OBJ(obj))
    unsafe_load(bag_ptr, narg + 1)
end

So why does serializing this involve an instance of type GapObj? Or what else is the problem?

It would be really good to know which objects it tries to serialize -- because only for some will serialization be possible, I am afraid...

fingolfin commented 2 years ago

I'd like to debug this (in particular, it would be good to know TNUM_OBJ(obj) for whatever GapObj objects it is trying to serialize) but am not sure how to go about that using gdb/lldb. I am not even sure how to find out the name of the function being serialize (GET_FUNC_PTR). I'll try to find out tonight/next week, but if there are any hints or tricks you could quickly point me at...

fingolfin commented 2 years ago

Oh wait, or is just serializing the type what causes problems, as opposed to serializing instances? Sorry for the clearly naive questiojns

timholy commented 2 years ago

Not naive questions at all.

Here's what's happening: Methods contain a roots field that stores items needed in codegen or code-compression. (The encoding of these roots changed recently, see #43793 and #43881.) The error gets triggered when trying to store a new root for (snapshot of debugging output)

New root for (file /home/tim/.julia/packages/GAP/wF2n6/src/wrappers.jl, line 61) String(Any)
    GAP.GapObj()

So it seems calling https://github.com/oscar-system/GAP.jl/blob/68938062694bc18e01e44f6574a6fcecffd08663/src/wrappers.jl#L61is the problem: calling it forces compilation, which adds a new root to the method, and serializing the new root is what triggers the error.

EDIT: the particular specialization is GAP.Wrappers.String(::GAP.FFE).

timholy commented 2 years ago

Yes, so in this case the error is not in serializing an instance but merely the root

MethodInstance for GAP._call_gap_func(::GAP.GapObj, ::GAP.FFE)

which only involves types. EDIT: sorry, off-by-one bug (C starts counting at 0). It's the root

julia> r = m.roots[4]
GAP: <Attribute "String">

julia> typeof(r)
GAP.GapObj

of GAP.Wrappers.String

I'm no longer sure of the connection to GET_FUNC_PTR. Keep in mind that serialization order is not deterministic, so the error varies from run-to-run.

fingolfin commented 2 years ago

Unfortunately that is a GAP function object and as such will be difficult to serialize. That said, I still think it would be good to add a (de)serialize code interface for foreign types, even if initially it only supports a few types -- at least we'd then have something to build on. Even better would be if these (de)serialize functions would be allowed to raise an exception to help debugging.

In this particular code, GAP.Wrapper.String, defined via @wrap String(x::Any)::Any, is a @generated function, which is meant to effect an optimization. I can turn off this optimization, then resulting function would become more or less

IsString(x) = GAP.Globals.IsString(x)

The @generated variant instead looks up the value GAP.Globals.IsString, and then emits code which directly uses that object.

Perhaps I could also do something like this:

const _IsString = Ref{GapObj}()
function _IsString(x)
  isassigned(_IsString) || _IsString[] = GAP.Globals.IsString
  return _IsString[](x)
end

but I'll have to benchmark how that performs (and whether it works at all :joy:)

timholy commented 2 years ago

I see. So even if we modified jl_new_foreign_type it might not actually fix Oscar?

Another option is to add __precompile__(false) to the top of Oscar. Is precompilation that important to you? Given that it's dancing around an error, It seems like a poster child for why you want to be able to turn off precompilation.

fingolfin commented 2 years ago

If we modify jl_new_foreign_type (or rather, the API for foreign types) to allow (de)serialization, that might eventually allow us to "fix" things, but it'll take time, because in the end we'd have to write a complete serialization scheme for what is essentially an entire separate programming language with its own ecosystem (GAP) -- note that it already has such a scheme, but I don't think it'll be possible to adapt that.

So it's not a solution on the short term, though laying foundations for it sooner rather than later still would be good, IMHO.

As to __precompile__(false) -- you ask "Is precompilation that important to you". I am not sure how to take that question? It seems to be pretty important for you, why wouldn't it be for us? I am confused... perhaps there is some confusion here: what we are discussing here concerns the GAP package, and by transitivity, packages using it, which right now means primarily Oscar. But GAP is only one part of Oscar among many.

At the end of the day, I don't think any Julia package which interfaces with external libraries via ccall can serialize types containing raw pointers to foreign C memory constructs, can they (Julia itself has hardcoded support for BigInt, but anything else...?). In that sense, the only things special about GAP is that (a) it tries hard to leverage the Julia GC instead of resorting to malloc or its own GC, and (b) we have used various tricks, including some uses of @generated, to enable various optimizations; but these are not essential (though of course, everyone prefers their code to be faster), and yes, they kinda try to bypass precompilation; but that's just for a small subset of Oscar.

fingolfin commented 2 years ago

So I tried adding precompile(Wrappers.String, (GapObj,)) to src/GAP.jl and it indeed triggered the crash.

Yet this did not:

foo(x::GapObj) = Globals.String(x)
precompile(foo, (GapObj,))

nor this

const _StringRef = Ref{GapObj}()
function bar(x::GapObj)
    if !isassigned(_StringRef)
        _StringRef[] = GAP.Globals.String
    end
    _StringRef[](x)
end
precompile(bar, (GapObj,))

To be honest, I don't quite understand why these are fine, but if the latter really is, then that would be a good replacement for the current @wrap implementation: according to some microbenchmarking, bar() is only marginally slower than GAP.Wrappers.String(), and still a lot faster than the "direct" call in foo().

timholy commented 2 years ago

The fundamental point is that unless you can serialize all your objects, Oscar is broken with respect to precompilation. The only reason you can make it work as-is is because Julia's current precompilation is very incomplete, and you've discovered a way to design the package that happens to work with Julia's current limitations. But when we remove those limitations, your current status won't be viable anymore. If there were a prospect of serializing GAP objects in the short-term I'd be happy to work with you, but given that it sounds super-hard and won't land anytime soon, we have to begin by acknowledging the fundamental reality that you'll have to drop some of the things that have currently been skirting the limits of viability.

[precompilation] seems to be pretty important for you, why wouldn't it be for us?

Because you're hardly caching any compiled code, so the only thing precompilation really does for you is save parsing and lowering steps (which are quite fast). Here's a comparison between one of "my" packages and Oscar (run on Julia 1.6):

(@v1.6) pkg> activate --temp
  Activating new environment at `/tmp/jl_a2ZhO4/Project.toml`

(jl_a2ZhO4) pkg> add MethodAnalysis, ImageCore, Oscar
<output suppressed>

julia> using MethodAnalysis, ImageCore, Oscar
<output suppressed>

julia> function instances_owned_by(mod::Module)
           # get all child modules of `mod`
           mods = Module[]
           visit(mod) do item
               if item isa Module && parentmodule(item) == mod
                   push!(mods, item)
                   return true
               end
               return false  # don't recurse into Methods, MethodTables, MethodInstances, etc.
           end
           # get all MethodInstances owned by one of the modules in `mods`
           # these are the only MethodInstances that can be precompiled in current versions of Julia
           return filter(methodinstances(mod)) do mi
               m = mi.def
               m isa Method && return m.module ∈ mods
               return m ∈ mods
           end
       end
instances_owned_by (generic function with 1 method)

julia> length(instances_owned_by(ImageCore))
2011

julia> length(instances_owned_by(Oscar))
15

This, despite the fact that Oscar owns 24947 separate methods (wow!) vs ImageCore's 1242. So relatively speaking you're doing about 2500x less precompilation.

But, it's nevertheless true that with precompilation Oscar takes ~16s on my machine to load, and ~30s if you add __precompile__(false). So even the relatively wimpy benefit of saving on parsing and lowering helps you.

Presumably the best way forward would be to split out the GAP-specific parts into a separate package that you load on top of Oscar, leave precompilation of Oscar on, and then set __precompile__(false) on all the parts related to GAP? Basically you want to precompile as much as you can (and if you add some precompilation steps to Oscar, especially with #43990 you may substantially decrease the latency for doing "real work"), but as soon as you introduce any non-serializable element all of its dependencies have to be __precompile__(false).

fingolfin commented 2 years ago

First off: thank you for taking the time to explain, this is very helpful to me. I hope you don't mind me asking more ...

The fundamental point is that unless you can serialize all your objects, Oscar is broken with respect to precompilation.

Sorry for being dense, but: Why do all objects need to be serializable? I do see why e.g. objects we are "calling" or otherwise directly referencing from functions that get precompiled need to be serialized. But most GAP object types don't fall into this category, they can only be generated at runtime (note that for Julia, they all appear as GapObj anyway).

Also: doesn't this also affect other packages which e.g. store pointers to external tools, like e.g. CxxWrap.jl does? AFAIK there is no way to provide a custom serializer for these types, either, or is there? I am talking about types like this:

struct CxxPtr{T} <: CxxBaseRef{T}
  cpp_object::Ptr{T}
  CxxPtr{T}(x::Ptr) where {T} = new{T}(x)
  CxxPtr{T}(x::CxxBaseRef) where {T} = new{T}(x.cpp_object)
end

Mind you, I am not trying to nitpick here, I am just trying to understand what the actual requirements are / where they come from.

The only reason you can make it work as-is is because Julia's current precompilation is very incomplete, and you've discovered a way to design the package that happens to work with Julia's current limitations. But when we remove those limitations, your current status won't be viable anymore.

Well, it's not as if we deliberately tried to find a loophole or whatever. We worked with what we had, and many of these things are not documented AFAIK :-/.

If there were a prospect of serializing GAP objects in the short-term I'd be happy to work with you, but given that it sounds super-hard and won't land anytime soon,

If it really has to be done for "all" objects: agreed.

we have to begin by acknowledging the fundamental reality that you'll have to drop some of the things that have currently been skirting the limits of viability.

[precompilation] seems to be pretty important for you, why wouldn't it be for us?

Because you're hardly caching any compiled code, so the only thing precompilation really does for you is save parsing and lowering steps (which are quite fast).

Of course we'd love to cache more. We've been thinking about this for quite some time. BTW, in my tests, GAP.jl is hardly the only problem point; we think CxxWrap is also a major contributor to the precompilation and startup troubles we've been facing (that's not intended as a dig on CxxWrap, mind you, kudos to Bart and others who worked on it).

Here's a comparison between one of "my" packages and Oscar (run on Julia 1.6):

(@v1.6) pkg> activate --temp
  Activating new environment at `/tmp/jl_a2ZhO4/Project.toml`

(jl_a2ZhO4) pkg> add MethodAnalysis, ImageCore, Oscar
<output suppressed>

julia> using MethodAnalysis, ImageCore, Oscar
<output suppressed>

julia> function instances_owned_by(mod::Module)
           # get all child modules of `mod`
           mods = Module[]
           visit(mod) do item
               if item isa Module && parentmodule(item) == mod
                   push!(mods, item)
                   return true
               end
               return false  # don't recurse into Methods, MethodTables, MethodInstances, etc.
           end
           # get all MethodInstances owned by one of the modules in `mods`
           # these are the only MethodInstances that can be precompiled in current versions of Julia
           return filter(methodinstances(mod)) do mi
               m = mi.def
               m isa Method && return m.module ∈ mods
               return m ∈ mods
           end
       end
instances_owned_by (generic function with 1 method)

julia> length(instances_owned_by(ImageCore))
2011

julia> length(instances_owned_by(Oscar))
15

This, despite the fact that Oscar owns 24947 separate methods (wow!) vs ImageCore's 1242. So relatively speaking you're doing about 2500x less precompilation.

Wow, thanks for these data points and that code snippets, that'll be very helpful in getting more things precompiled!

Because as I said: we'd love to do more, to cut down our startup time, which is atrocious as you point out yourself below. As I said, we tried in the past, but usually quickly run into problems, and lacked the know-how to overcome them. Some of them I now understand were already then caused by GAP.jl (though there were no diagnostics then that we knew that would have told us that; I've landed a patch some time ago for Julia to at least tell us when precompilation crashes due to a "foreign type".

But, it's nevertheless true that with precompilation Oscar takes ~16s on my machine to load, and ~30s if you add __precompile__(false). So even the relatively wimpy benefit of saving on parsing and lowering helps you.

Aye. It actually used to be far worse a couple Julia versions before.

Presumably the best way forward would be to split out the GAP-specific parts into a separate package that you load on top of Oscar, leave precompilation of Oscar on, and then set __precompile__(false) on all the parts related to GAP?

That's not really an option, Oscar.jl must remain at the top, though of course in principle we could do something equivalent by moving most "non-GAP" code out of Oscar.jl into a separate package (which could even live in the same repository, in a subdir) and let Oscar.jl depend on that.

I'd still rather avoid this, but it is an option, agreed.

Basically you want to precompile as much as you can (and if you add some precompilation steps to Oscar, especially with #43990 you may substantially decrease the latency for doing "real work"),

That would be super sweet indeed. BTW, thank you for working on all this, and all the tools and packages to help with debugging and improving these things, that's a great service to the community -- I wish I had more time to dig into them more and leverage them to their fullest. I and others working on Oscar will keep trying to do so, but there's only so much time in a day... sigh. Perhaps we can hire someone with expertise on this for a time, we (shameless plug: if you or anyone else here happen to know someone interested in working at a German university on this kind of stuff, we have lots of funding for Phd and Postdoc positions that could work on areas related to this.)

but as soon as you introduce any non-serializable element all of its dependencies have to be __precompile__(false).

So right now, I am still hoping that I can get away with our "loopholes" a bit longer. I have a modified version of GAP.jl which does not trigger the precompilation crash anymore, but I have not yet tested it with your PR. And of course even if it does work, what you say above makes it sound likely that the next variant of the issue may pop up soon... I need to understand it better.

And I would love to have serialization on the long run, too :-)

vchuravy commented 2 years ago

doesn't this also affect other packages which e.g. store pointers to external tools

Yes, and iirc precipitation zeros pointers on serialization. On the top of my head you are not allowed to cache the runtime pointers and have to reconstitute them during init and various packages do so.

fingolfin commented 2 years ago

@vchuravy ah, interesting. I'd like to learn more, is this documented anywhere? I just tried searching https://docs.julialang.org/en/v1/manual/calling-c-and-fortran-code/ and https://docs.julialang.org/en/v1/base/c/ but found nothing (maybe I searched for the wrong things or the wrong pages, though).

vchuravy commented 2 years ago

It is in https://docs.julialang.org/en/v1/manual/modules/#Module-initialization-and-precompilation

Two typical uses of init are calling runtime initialization functions of external C libraries and initializing global constants that involve pointers returned by external libraries. ... Constants involving most Julia objects that are not produced by ccall do not need to be placed in init: their definitions can be precompiled and loaded from the cached module image. This includes complicated heap-allocated objects like arrays. However, any routine that returns a raw pointer value must be called at runtime for precompilation to work (Ptr objects will turn into null pointers unless they are hidden inside an isbits object). This includes the return values of the Julia functions @cfunction and pointer.

Using a @generated function to cache an external pointer lookup is also on the iffy side of things, since you can materialize a pointer out of thin air and then embed in the IR after inlining. (and probably falls under the "non-constant global state may not be observed rule").

fingolfin commented 2 years ago

Thanks for the pointer! I am working on a PR to get rid of the use of @generated in GAP.jl

fingolfin commented 2 years ago

It seems that with GAP.jl from https://github.com/oscar-system/GAP.jl/pull/780 it is possible to precompile Oscar.jl while using PR #43990.

fingolfin commented 2 years ago

We just discussed the serialization of ptrs and no wonder if this is not a serious hidden footgun, a time bomb waiting to explode: code which works fine right now may sometime soon (when more optimization are added) start to crash, namely when suddenly objects containing Ptr instances get serialized which weren't before; the surrounding code then may not be prepared to deal with those pointers becoming NULL (in fact, it may be impossible for it to do so).

So perhaps a serialization interface may become necessary even for regular (mutable) struct types, if they contain pointers? Or at the very least, perhaps a way to mark a struct type as "not serializable", so that attempts to serialize it lead to an error, instead of working and producing weird errors later on. Of course to be effective, ideally it then also would give the user some information about what it was trying to serialize (just like it now at least says that it died because of serializing a "foreign type" -- though even better would be if the full serialization stack could be printed, so that one can find out what triggered it; something I couldn't do in the past, but Tim thankfully helped me with here, thank you again)

timholy commented 2 years ago

Why do all objects need to be serializable? I do see why e.g. objects we are "calling" or otherwise directly referencing from functions that get precompiled need to be serialized.

As you discovered later with your generated function fix, anything that makes it into some aspect of the compiled code needs to be serializable. Unless, as you say, we develop some mechanism to mark certain items as non-serializable.

Well, it's not as if we deliberately tried to find a loophole or whatever. We worked with what we had, and many of these things are not documented AFAIK :-/.

Agreed 100%. Just last week I tried to warm up some code by evaluating into a temporary module (https://github.com/JuliaDebug/JuliaInterpreter.jl/pull/514#issuecomment-1030227282) and would have gone with that had it not thrown an error. You've been required to use the lack of errors as validation of your strategy, and that's not easy (and the errors haven't covered all "dangerous" things).

And wow, very impressive fix to the crash! In turn, I will try to add to #43990 more detail about the source of the error when serializing foreign types. I'm not sure how simple it will be to "unwind the stack" because currently it's not easy to even detect the error until you encounter it and with C there isn't an exception mechanism to rely on. We could leave some breadcrumbs (a stack of "what I'm serializing now") but there might be performance implications. Still, since that would only affect serialization and not deserialization, maybe it would be worth it (it's OK if compilation is a bit slow as long as usage is fast, because most people who aren't the package developer compile rarely and use frequently).

fingolfin commented 2 years ago

Actually may claims may have been premature. I just retested after some sleep and some revising of my quick and dirty PR and now it does crash again :-(. I will check more carefully once I am finished teaching.

Sorry :-(

fingolfin commented 2 years ago

Any hints you might have on how to pinpoint the function/code causing the serializing failure from gdb/lldb would be appreciated .

timholy commented 2 years ago

I'll push a branch that implements a way to unwind the serialization stack. I assume it's only useful if built on #43990, right?

fingolfin commented 2 years ago

Yeah. Thanks!

timholy commented 2 years ago

OK, there's a branch called teh/serialization_errors. If you build it and launch in "regular" mode, here's what it does with Oscar v0.7.1:

julia> using Oscar
[ Info: Precompiling Oscar [f1435218-dba5-11e9-1e4d-f1a5fab5fc13]
Serialization error encountered. Here is a stack of Methods, MethodInstances, CodeInstances, and method roots (as an svec(index, root)):
Array{Any, (6,)}[
  reinterpret(Type{Ptr{Nothing}}, GAP.FFE) from reinterpret(Type{T}, Any) where {T},
  _JULIA_TO_GAP(GAP.FFE) from _JULIA_TO_GAP(GAP.FFE),
  _call_gap_func(GAP.GapObj, GAP.FFE) from _call_gap_func(GAP.GapObj, Any),
  String(GAP.FFE) from String(Any),
  String(Any),
  svec(3, GAP.GapObj())]
Current item is at end.
fatal: error thrown and no exception handler available.
ErrorException("Cannot serialize instances of foreign datatypes")
ijl_error at /home/tim/src/julia-master/src/rtutils.c:41
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:995
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:785
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:841
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:664
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:864
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:664
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:864
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:664
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:864
jl_serialize_value_ at /home/tim/src/julia-master/src/dump.c:609 [inlined]
serialize_htable_keys at /home/tim/src/julia-master/src/dump.c:1049 [inlined]
ijl_save_incremental at /home/tim/src/julia-master/src/dump.c:2590
jl_write_compiler_output at /home/tim/src/julia-master/src/precompile.c:65
ijl_atexit_hook at /home/tim/src/julia-master/src/init.c:207
jl_repl_entrypoint at /home/tim/src/julia-master/src/jlapi.c:707
main at /home/tim/src/julia-master/cli/loader_exe.c:59
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/tim/src/julia-master/usr/bin/julia (unknown line)
ERROR: Failed to precompile Oscar [f1435218-dba5-11e9-1e4d-f1a5fab5fc13] to /home/tim/.julia/compiled/v1.8/Oscar/jl_beUYcF.

The last item on the list merits some additional comment: as the header explains, that's the sign of a method root. The interpretation is that it errorred while serializing a method root, which came from a method (String(Any)), which was serialized from a MethodInstance (String(GAP.FFE) from String(Any)), which was serialized from another MethodInstance (as a backedge) (_call_gap_func(GAP.GapObj, GAP.FFE) from _call_gap_func(GAP.GapObj, Any)), and so on.

You can see the roots of a method with m.roots, and you can see what roots are attached to a given function even on that branch if you launch Julia with --compiled-modules=no. So you could load Oscar and look at the whole list if you wanted. The first item is the C index, so that root is the fourth one:

julia> m = which(Oscar.GAP.Wrappers.String, (Any,))
String(x) in GAP.Wrappers at /home/tim/.julia/packages/GAP/wF2n6/src/wrappers.jl:61

julia> m.roots
6-element Vector{Any}:
 GAP
 :_call_gap_func
 MethodInstance for GAP._call_gap_func(::GAP.GapObj, ::GAP.FFE)
 GAP: <Attribute "String">
 typeof(GAP._call_gap_func) (singleton type of function _call_gap_func, subtype of Function)
 Symbol("/home/tim/.julia/packages/GAP/wF2n6/src/wrappers.jl")

Note that only "new" roots are being added to this debug stack. New roots are those added after the module was defined: that root was added after the GAP module was closed, because some downstream package forced compilation of that MethodInstance. Details are probably most thoroughly described in #42016.

fingolfin commented 2 years ago

Thanks @timholy I'll try. BTW Oscar is now at 0.8.0 (not that it'll make a difference)

Actually, it (Oscar master with my GAP.jl PR deved) does precompile fine with that PR. But since I already said that once, and then said it again, I'll test some more. I was switching between various Julia dev versions at that time, perhaps some precompilation cache was borked somewhere? I'll try to force it to precompile and run the test suite and some more

fingolfin commented 2 years ago

I've discussed this a bit with @timholy on Slack, and did rm -Rf ~/.julia/compiled/v1.8/* to make sure I really get everything precompiled from scratch -- and it worked.

I've now released GAP.jl 0.7.6. Once that is in the registry, @timholy could rerun his tests for PR #43990 with latest Oscar.jl & GAP.jl, and hopefully it'll pass now.