JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.6k stars 5.48k forks source link

ccall hangs on windows #5323

Closed tknopp closed 10 years ago

tknopp commented 10 years ago

Having a strange issue with ccall on windows 7 (64bit, Julia 0.2). I am using a National Instruments data acquisition card and try to call a function of the API via ccall (http://zone.ni.com/reference/en-XX/help/370471W-01/daqmxcfunc/daqmxresetdevice/). When I do this from C or from Python using ctypes this works fine. When I use ccall, the function hangs and does not return. Even when I call the function indirectly via a self-build DLL the function hangs, while there are no issue when using this from C/Python.

I have really no idea whats going on there and if this is a Julia issue at all. Can it be that there are problems when calling non-mingw dlls from Julia that has been build with mingw?

quinnj commented 10 years ago

Is it a calling convention problem? I know with ODBC and other dlls that get tight with windows internals, the calling convention has to be specified stdcall, right after the (:function, lib), tuple. E.g. from ODBC

function SQLFetchScroll(stmt::Ptr{Void},fetch_orientation::Int16,fetch_offset::Int)
    @windows_only ret = ccall( (:SQLFetchScroll, odbc_dm), stdcall, 
        Int16, (Ptr{Void},Int16,Int), 
        stmt,fetch_orientation,fetch_offset) 
    @unix_only ret = ccall( (:SQLFetchScroll, odbc_dm),
            Int16, (Ptr{Void},Int16,Int), 
            stmt,fetch_orientation,fetch_offset) 
    return ret
end
tknopp commented 10 years ago

I don't think so. The calling convention that has to be used is "stdcall" and I use that in the ccall. Furthermore in the "self-written" DLL, which is actually a released product, the NI DLL is lazily loaded and there is also stdcall calling convention been used.

Unfortunately this is a hard to debug problem as the NI DLL is closed source. I have attached to a debug build of the "self-written" DLL using Visual Studio and can step until the call to the NI function, which then hangs deeply in some MS API. Again, if I do the same from a C there is no issue.

ihnorton commented 10 years ago

Did you try compiling your self-built DLL with MinGW? Also, there are a number of postings on the NI list about using with MinGW, which may apply to Julia as well (eg how to extract libs).

I've also seen mention of two types of error handling for NIDAQMX - "simple" and "general". I didn't find a description of what these actually mean, but this could be causing or at least masking the problem. For example, if the NI code is throwing a C++ exception that is not supported by the runtime. [edit: although one would assume that the C interface does not throw C++ exceptions, it could still be written in C++ underneath and conflict with the runtime]

tknopp commented 10 years ago

Unfortunatly building the self-build DLL with MinGW is not possible as this is a larger project.

But maybe this is really an issue with different c++ stdlibs. But this would indicate a larger problem on windows as I would assume that MSVC/Intel build DLLs are the standard on windows and not the exception. Maybe its necessary to compile julia with MSVC then.

ihnorton commented 10 years ago

Well, how about a very simple stub C code that just makes a few calls like the one that hangs from Julia, and try compiling that with MinGW. If you can get that to work (perhaps using the extraction technique I linked) then it may help to isolate the issue.

tknopp commented 10 years ago

Yes that seems to be the best idea. Will install MinGW and try that. A further check is to compile a MSVC dll with a stdcall convention and look if this fails. The issue seem to be that the stdcall convention of MinGW and MSCV are not(!) compatible.

tknopp commented 10 years ago

I am not sure if this is the same issue but the following code, which relies only on kernel32.dll crashes:

buf = zeros(Uint8,1024); ccall( (:GetComputerNameA,"Kernel32"), stdcall, Int32, (Ptr{Uint8}, Int32), buf, 1024)

Looking at http://llvm.org/docs/doxygen/html/namespacellvm_1_1CallingConv.html#a4f861731fc6dbfdccc05af5968d98974 there seem to be a flag X86_64_Win64. Might it be that this needs to be used in ccall for 64 windows?

JeffBezanson commented 10 years ago

cc @vtjnash

vtjnash commented 10 years ago

Win64 only has one calling convention

Type sig for your last example is wrong -- second parameter needs to Ptr{Int} not Int32

Will look more tonight

tknopp commented 10 years ago

Sorry for that, using the correct second parameter works. So it seems to be no general issue with non-mingw libs.

vtjnash commented 10 years ago

@tknopp do you have access to a recent 0.3 build? I'm thinking that https://github.com/JuliaLang/julia/commit/524e305872f4a0946a283bc75308c13462432285 (and followup commits to debuginfo.cpp) may have fixed this. Also, can you post the ccall line you are using?

tknopp commented 10 years ago

Is there some nightly build of Julia? I habe not yet setup mingw.

tknopp commented 10 years ago

I use the ccall

ccall( (:DAQmxResetDevice, "C:\\Windows\\System32\\nicaiu.dll"), stdcall, Int32, (Ptr{Uint8},),"Dev3")

The call hangs no matter which calling convention I use.

When I call DAQmxResetDevice indirectly I have a library "MyLib.dll" where I have a funtion doMeasurement that lazily loads DAQmxResetDevice using GetProcAddress. When I attach to "julia-readline" doMeasurement is called fine but when invoking the (valid) function pointer of DAQmxResetDevice the call hangs. When I break I get the following call stack:

ntdll.dll!00000000779d15fa()    
[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] 
KernelBase.dll!000007fefdb11203()   
nipalu.dll!0000000006c930b2()   
nipalu.dll!0000000006c92d41()   
nidmxfu.dll!0000000011668850()  
nidmxfu.dll!0000000011668258()  
nidmxfu.dll!0000000011668dfb()  
nidmxfu.dll!00000000117a6327()  
nidmxfu.dll!000000001178a837()  
nidmxfu.dll!00000000117bf17e()  
nidmxfu.dll!000000001178408e()  
nidmxfu.dll!000000001178453c()  
nidmxfu.dll!0000000011755445()  
nidmxfu.dll!0000000011752945()  
nidmxfu.dll!0000000011689e1a()  
nidmxfu.dll!00000000118f65ca()  
nicaiu.dll!00000001802648ac()

So he is at least in the right dll. I really cannot see why this has anything to do with Julia but doing it from C or Python (using ctypes) works without issues.

tknopp commented 10 years ago

I have tried @ihnorton suggestion and compiled a small c++ program with MinGW and this works fine:

#define WIN32_LEAN_AND_MEAN
#include <windows.h>

typedef int (__stdcall* ResetDevice_t)(const char[]);

int main()
{
  HMODULE dll_module = LoadLibraryA("nicaiu.dll");

  ResetDevice_t _ResetDevice = reinterpret_cast<ResetDevice_t>(GetProcAddress(dll_module, "DAQmxResetDevice"));

  int result = (*_ResetDevice)("Dev3");  
}
vtjnash commented 10 years ago

@staticfloat did you have a bleeding edge version that can be uploaded to julialang?

@JeffBezanson are you using 0.2 for IAP? I can back-port this fix to the release branch if so.

tknopp commented 10 years ago

I am currently trying to build Julia Master with MinGW using the README.windows but make fails early with make: *\ No rule to make target /home/tknopp/julia/usr/bin', needed byrelease'. Stop.

Unfortunately I am not a make expert.

tknopp commented 10 years ago

Ok it now starts to compile. It seems to be that in step 5 of README.windows one has to replace the make.exe in C:/mingw-builds/msys/bin and not putting it into mingw64\bin. Further I had to replace "C:/" by "/C/".

JeffBezanson commented 10 years ago

It's generally good to back-port any pure bug fixes, if it doesn't take too much effort.

tknopp commented 10 years ago

I have still not got julia compiled under windows using MinGW due to a crash when building llvm (tblgen.exe crashes) So if someone has a recent build of Julia this would be great.

ihnorton commented 10 years ago

@tknopp I started a cross-compile this morning but had to run out for work. I can send you a binary when I get home tonight (EST) ..unless somebody else has one.

staticfloat commented 10 years ago

@vtjnash My windows builds still die due to #5142

vtjnash commented 10 years ago

@JeffBezanson that patch is not quite only a pure bug fix, and it is not quite easy entirely to backport. however, this may tip the balance in favor of making it happen

ihnorton commented 10 years ago

I sent @tknopp a build to test; can put it up somewhere if there is interest.

tknopp commented 10 years ago

@ihnorton Thanks! But where have you send this to?

ihnorton commented 10 years ago

@tknopp your googlemail from the mailing list... re-sent just now.

tknopp commented 10 years ago

Thanks Isaiah. My findings:

vtjnash commented 10 years ago

@tknopp does your same C program run against the Julia master version from ihnorton work?

tknopp commented 10 years ago

@vtjnash: Yes when I call jl_eval_string from a C program compiled with the Intel compiler using the MinGW libjulia.dll from @ihnorton works. This is so weird. This basically leavs the repl but I cannot think of anything causing this there

vtjnash commented 10 years ago

the repl uses multiple tasks, your example code presumably does not. it may be helpful to load the symbol table for ntdll.dll in your debugger so that you get a valid stack trace

tknopp commented 10 years ago

When I load symbols I get the following for the windows libraries

ntdll.dll!ZwDelayExecution() + 0xa bytes
KernelBase.dll!SleepEx() + 0xb3 bytes

I don't know if I could get debug symbols for the NI libs.

You said that repl uses multiple tasks. But I thought that this is serialized code. Or are there threads involved anywhere?

vtjnash commented 10 years ago

Am I correct in assuming that your Intel compiled code was 64-bit and using the native platform setjmp? If so, that would explain why the REPL doesn't work (or tasks, or exception handling, for that matter). I hadn't really paid attention to the Windows.mk file, but I see it doesn't seem to list the setjmp/longjmp assembly files.

tknopp commented 10 years ago

@vtjnash This is funny. When trying to get MSVC to work I asked myself why I had no issues with setjmp when compiling with the Intel compiler. The reason is that julia.h has changed and is now wrong:

#if defined(_OS_WINDOWS_)
#if defined(_COMPILER_MINGW_)
int __attribute__ ((__nothrow__,__returns_twice__)) jl_setjmp(jmp_buf _Buf);
__declspec(noreturn) __attribute__ ((__nothrow__)) void jl_longjmp(jmp_buf_Buf,int _Value);
#else
int jl_setjmp(jmp_buf _Buf);
void jl_longjmp(jmp_buf _Buf,int _Value);
#endif
#define jl_setjmp_f jl_setjmp
#define jl_setjmp_name "jl_setjmp"
#define jl_setjmp(a,b) jl_setjmp(a)
#define jl_longjmp(a,b) jl_longjmp(a,b)
#else
// determine actual entry point name

while about 1-2 month ago:

#if defined(_OS_WINDOWS_)
#if defined(_COMPILER_MINGW_)
int __attribute__ ((__nothrow__,__returns_twice__)) jl_setjmp(jmp_buf _Buf);
__declspec(noreturn) __attribute__ ((__nothrow__)) void jl_longjmp(jmp_buf _Buf,int _Value);
#define jl_setjmp_f jl_setjmp
#define jl_setjmp(a,b) jl_setjmp(a)
#define jl_longjmp(a,b) jl_longjmp(a,b)
#else
#define jl_setjmp_f setjmp
#define jl_setjmp(a,b) setjmp(a)
#define jl_longjmp(a,b) longjmp(a,b)
#endif
#else
// determine actual entry point name

But this is more a side note. As both the libjulia compiled with MSVC and Intel work in my C program, the issue seems to be something else.

tknopp commented 10 years ago

@vtjnash, @ihnorton: I have found a solution to this problem: Upgrade the DAQmx lib to the recent one. This is of course not completely satisfying but if this is a multithreading issue in that lib I can't see that we can do anything about that.

tknopp commented 10 years ago

Ok, it worked after upgrading the installer but before restarting windows. Now after restart it hangs again.

timholy commented 10 years ago

What happens when you ccall the stub function suggested by @ihnorton? Perhaps something like this

int myDAQmxResetDevice(char *name)
{
    printf("About to reset device for %s\n", name);
    fflush(stdout);
    int status = DAQmxResetDevice(name);
    printf("status = %d\n", status);
    fflush(stdout);

    return status;
}

and ccall it; it might give you a clue about when the hang is happening.

vtjnash commented 10 years ago

Can you try again now that the Scheduler has been removed?

tknopp commented 10 years ago

Yes will try tomorrow when I have a device at hand (provided that the nightly is working).

tknopp commented 10 years ago

Actually I cannot find a recent nightly. I thought that there were download links at http://status.julialang.org/ The prerelease from http://julialang.org/downloads/ is 4 days old and thus does not include the removed scheduler.

tknopp commented 10 years ago

Just an update that this still does not work with the recent 0.3 prereleases. I have tested both 32bit and the 64bit version.

ihnorton commented 10 years ago

I have tested this with a NI USB a/d device and the latest NIDAQ library, and see the same result.

I compiled Tim's code fragment under MSVC, and then ran Julia under the Visual Studio debugger. Obviously this is useless for Julia, but allows setting a breakpoint in my DLL. As soon as I step through the DAQmxResetDevice call, the Julia window is raised again - the call never returns, even when I interrupt Julia (which works fine). I've been able to step into disassembly of the NI libraries for a while, back and forth in to critical sections, etc. but haven't reached where it is giving control back to Julia.

vtjnash commented 10 years ago

As a workaround, you could try switching to DAQmxBase. While the documentation often tries to redirect you into using DAQmx, my experience was that the base version was more cross-platform. I was not sure if they actually even share the same codebase. In my testing, the DAQmx code appeared to use a worker thread, and was best suited for interaction through their GUI, whereas DAQmxbase was better suited for accessing from C (the cross-platform ability was also a requirement for me at the time).

If you run code without the repl, does it work? Perhaps the julia interrupt signal handlers are getting in the way.

tknopp commented 10 years ago

Thanks @ihnorton so much for testing this. This at least gives me evidence that I am not crazy seeing this bug :-) In my code base the DAQmxResetDevice is much deeper located in several function calls and I was also able to step until DAQmxResetDevice where it hangs.

Currently this issue is not super important for me. But still given that Python using ctypes does not face this issue it would be nice to determine whether there is some fundamental problem in Julia. Something within Julia has to trigger it.

ihnorton commented 10 years ago

Regarding Jameson's theory about threads, what I observed was four new threads being spawned after entering DAQmxResetDevice, three of them waiting and ending, and then there was still one left when control jumped back to Julia. I will try building a version without signal handlers, and also using Visual Studio 2012 which seems to have some better multi-threaded debugging support than 2008.

tknopp commented 10 years ago

yeehaa. I just downloaded the lastest windows binary and the issue disappeared. I tried some weeks ago and it was still there. @ihnorton: Maybe you could confirm? I close this now as I am not seeing it anymore. Thanks to whoever fixed it :-)

vtjnash commented 10 years ago

Perhaps this too was related to the openblas permissions/leaked handle issue?