ericsink / SQLitePCL.raw

A Portable Class Library (PCL) for low-level (raw) access to SQLite
Apache License 2.0
512 stars 106 forks source link

Incorrect calling convention in UnmanagedCallersOnly #506

Closed filipnavara closed 1 year ago

filipnavara commented 1 year ago

We are using the sqlite3_config_log API. Apparently somewhere between 2.0.7 and 2.1.0 you switched to use UnmanagedCallersOnly. Unfortunately, this means the decorated callback now defaults to the platform calling convention which is stdcall on Windows. The native library (e_sqlite3) expects cdecl though. This mismatch causes access violation once the registered callbacks returns back to the native code.

ericsink commented 1 year ago

Er, my bad.

Looks like I can just do something like:

[UnmanagedCallersOnly(CallConvs = new[] { typeof(CallConvCdecl) })]
filipnavara commented 1 year ago

I expect that to be the solution but I didn't get to try it yet.

ericsink commented 1 year ago

FWIW, version 2.1.1-pre20220822172036 contains this fix. If there are no problems with it, I will publish it as a non-pre release next week. Reopening this issue until I ship a non-pre-release containing its fix.

ericsink commented 1 year ago

AFAICT, the fix made in #507 was intended to address a problem that is Windows-specific, but it has caused iOS platforms to fail completely during AOT, probably due to a bug in the Mono AOT compiler. That bug has been logged at dotnet/runtime#75700.

What probably needs to happen here is to make the necessary fix to calling convention be Windows-specific.

In the meantime, the fix is a case where the cure is worse than the disease. I plan to revert, so I am reopening this bug.

filipnavara commented 1 year ago

That's quite unfortunate since that will make it unusable for us on Windows again :-/ I will try to see if I can expedite the Mono fix.

ericsink commented 1 year ago

@filipnavara Aren't you on iOS as well? I mean, the change fixes Windows, but it completely breaks iOS. I sort of expected you to be affected on both sides. If the change somehow did not break iOS builds for you, then I'm misunderstanding something.

I'll try today to rework your fix to take place on Windows only. The T4 stuff already generates several variants of things, so it's not too hard to isolate what you need.

filipnavara commented 1 year ago

It's complicated...

We are on legacy Xamarin.iOS which uses different flavor of SQLitePCL.raw that doesn't suffer from the issue.

For a moment we were running some forked builds but I really want to avoid that if possible. First, we could not switch to official builds because we needed the macOS fix for ARM64 varargs calling convention. Once that shipped we could not upgrade because Windows was blocked by this issue (builds but crashes at runtime). Now we are basically on 2.1.1 with our fork abandoned.

We do also have internal builds on net6.0-ios but they are not up to date with the X.iOS branches due to miscellaneous issues so we would not hit the SQLitePCL.raw breakage yet.

ericsink commented 1 year ago

Ah. That makes sense. Yes, quite "complicated".

I'm working now on reworking your fix to avoid the net6.0-ios problem.

filipnavara commented 1 year ago

Thanks a lot. I really appreciate it.

ericsink commented 1 year ago

@filipnavara Are you using the winsqlite3 provider?

(I'm pretty sure you're not.)

ericsink commented 1 year ago

2.1.2-pre20220916165053 is available on nuget and contains the revised version of the fix for this issue.

ericsink commented 1 year ago

@filipnavara One other question about this issue:

There is a test case for sqlite3_config_log:

https://github.com/ericsink/SQLitePCL.raw/blob/9adcbfce7ba69450d3173f2d129a468023279042/src/common/tests_xunit.cs#L2850-L2873

but for some reason, it didn't fail back when I made the switch to use UnmanagedCallersOnly. It should have caught the failure you have been seeing, but it did not.

I have verified that this test is being included in test runs, and that it passes on Windows.

Any idea how to modify that test (or add another one) to make a verification for this issue?

filipnavara commented 1 year ago

Are you using the winsqlite3 provider?

We use the e_sqlite3 one.

Hmm, I'll have to check about the test. I remember it called the callback and trashed the stack after. It would be interesting to see what happens if sqlite3_log is called twice.

ericsink commented 1 year ago

FWIW, I tried modifying the test to call sqlite3_log multiples times, but I still couldn't get it to fail (with the CallConv stuff disabled).

I committed the changes to that test at b166926, but I messed up the commit message and referenced the wrong bug number instead of this one.

filipnavara commented 1 year ago

One more thought, do the tests run only as 64-bit target or does it test win-x86 too? We observed it on 32-bit process which actually has different calling conventions.

ericsink commented 1 year ago

Ah, 32-bit. That's probably the difference. The test suites have not been running with win-x86.

And in fact, they don't pass. In fact, there are multiple problems, and even though your calling convention fix is apparently sufficient for your situation, it does not seem to be enough to get all my tests to pass.

So, I've apparently got some work to do for 32-bit.

When I release 2.1.2, I'll probably close this issue and open new one(s) for 32-bit problems.

filipnavara commented 1 year ago

In retrospective that makes sense 😅 x86 is the weird one with three major calling conventions, x64 and arm64 both use only one (which differs per platform but not per method).

ericsink commented 1 year ago

@filipnavara So... for your Windows builds, you're using a target framework of net6.0-windows (or similar), rather than just net6.0, right?

When I tweaked your calling convention fix to be effective only on Windows, I used some existing stuff, and so that version of the provider only gets used for a net6.0-windows TFM. So, above, when I said I tried running those fixes on win-x86, and found that lots of things were broken, I was building with net6.0, so the calling convention fixes were not actually in play. But, if I run the test suite targeting net6.0-windows, everything passes.

I might need to tweak the nuspec to make that version of the provider assembly match net6.0 when the RID is win-x86.

filipnavara commented 1 year ago

So... for your Windows builds, you're using a target framework of net6.0-windows (or similar), rather than just net6.0, right?

Correct. net6.0-windows TFM and win-x86 RID for the executable project.

ericsink commented 1 year ago

Fixed in 2.1.2, which is now on nuget. Also logged #518 as a leftover.