EnzymeAD / rust

A rust fork to work towards Enzyme integration
https://www.rust-lang.org
Other
72 stars 8 forks source link

reproducability between rustc and opt #130

Closed ZuseZ4 closed 5 months ago

ZuseZ4 commented 5 months ago

We have an issue where we aren't able to reproduce a bug that shows up in rustc+enzyme through opt. I have a flag ENZYME_PRINT_MOD_BEFORE which gives the whole module llvm-ir (fat lto etc.) just before calling Enzyme, so we were expecting that writing a dummy function that calls __enzyme_autodiff and using the opt plugin (on the same llvm) would give the same error. For which reasons could this fail? I was wondering about adding a flag that dumps all the TT for the outermost function into the IR (like we do it with mem* functions already). Those are otherwise directly passed to Enzyme and thus not part of the IR. Could there be another reason why we fail to reproduce bugs?

wsmoses commented 5 months ago

You can export the type trees directly onto arguments as attributes so there is no additional data added from the api call not otherwise in the module

ZuseZ4 commented 5 months ago

Agree, that's

I was wondering about adding a flag that dumps all the TT for the outermost function into the IR (like we do it with mem* functions already).

My question was, are there other reasons due to which the ABI could differ from the opt pass?

wsmoses commented 5 months ago

I mean I wouldn’t even bother with a flag just pass it as a fn attribute always and don’t bother with the api version and you can be sure that’s not it

wsmoses commented 5 months ago

I think the other big question is ensuring both the same optimization and analysis state is available.

llvm registers some analyses as available so that could be one cause if opt doesn’t register the same before running passes

ZuseZ4 commented 5 months ago

Didn't you say that the there was a huge compile time penalty of this (IIRC on the julia side), or was that in another case? Yes, if compile times are the same I'd change it unconditionally.

wsmoses commented 5 months ago

There was a huge compile time benefit

ZuseZ4 commented 5 months ago

Rip. Yes then I'll try that and come back if we still get different results. thx.

wsmoses commented 5 months ago

Specifically it alongside the no ta recur flag enable a mostly losses version of the info of TA except now instead of doing a very expensive interprocedural fixed point across all values in all functions, it’s now factored to only needed to reach a fixed point in a single function (for the most part).

It’s not the be all end all to compile times, but means big programs should no longer have catastrophic TA analysis runtime by virtue of purely being big. We still need to port into c++ tho

ZuseZ4 commented 5 months ago

Naive question. But if passing TT in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into fn attributes in the LLVM-IR?

wsmoses commented 5 months ago

The perf speed up isn’t the fact that attributes are faster than api (it’s actually slightly slower rn since we parse the string to an int). The benefit is that if you ahead of time attribute everything 1) you only need to recurse through language types once (which that parse Julia type into enzyme TT was expensive so that’s now saved across all versions) 2) custom api calls in type analyses are much more expensive than the analysis just running without a plugin 3) and this is the most critical one: by putting the full info at each fn, you can get almost all the same info you need without the need to run interprocedural so if you imagine each function level TA takes time n without considering a sub call. If getting new info means a sub call would have more info, it would propagate it to its sub calls etc. so worst case scenario you could have a ping pong between some very nested call data structures causing that n to be multiplied by a stupid asymptotic (n! Maybe, would need to think)

On Wed, Jul 3, 2024 at 5:52 PM Manuel Drehwald @.***> wrote:

Naive question. But if passing TA in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into the LLVM-IR?

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/rust/issues/130#issuecomment-2207374100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXGVVGKVCMQMSFIYYLLZKRXCHAVCNFSM6AAAAABKKK6OZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGM3TIMJQGA . You are receiving this because you were assigned.Message ID: @.***>

wsmoses commented 5 months ago

To be clear there’s still bad perf possible failure modes here, but now they’re limited to at most the size of a function instead of the entire program (hence the claim that large applications don’t suffer as a result).

On Wed, Jul 3, 2024 at 5:58 PM Billy Moses @.***> wrote:

The perf speed up isn’t the fact that attributes are faster than api (it’s actually slightly slower rn since we parse the string to an int). The benefit is that if you ahead of time attribute everything 1) you only need to recurse through language types once (which that parse Julia type into enzyme TT was expensive so that’s now saved across all versions) 2) custom api calls in type analyses are much more expensive than the analysis just running without a plugin 3) and this is the most critical one: by putting the full info at each fn, you can get almost all the same info you need without the need to run interprocedural so if you imagine each function level TA takes time n without considering a sub call. If getting new info means a sub call would have more info, it would propagate it to its sub calls etc. so worst case scenario you could have a ping pong between some very nested call data structures causing that n to be multiplied by a stupid asymptotic (n! Maybe, would need to think)

On Wed, Jul 3, 2024 at 5:52 PM Manuel Drehwald @.***> wrote:

Naive question. But if passing TA in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into the LLVM-IR?

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/rust/issues/130#issuecomment-2207374100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXGVVGKVCMQMSFIYYLLZKRXCHAVCNFSM6AAAAABKKK6OZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGM3TIMJQGA . You are receiving this because you were assigned.Message ID: @.***>

wsmoses commented 5 months ago

Also I will note effective usage of this requires the language to provide full type trees for all args and returns for all functions (useful even with this off, but if you had interprocedural on it would spend time to possibly derive this even if not specified).

We do this for Julia and I just haven’t gotten to doing that for c++ yet

On Wed, Jul 3, 2024 at 6:01 PM Billy Moses @.***> wrote:

To be clear there’s still bad perf possible failure modes here, but now they’re limited to at most the size of a function instead of the entire program (hence the claim that large applications don’t suffer as a result).

On Wed, Jul 3, 2024 at 5:58 PM Billy Moses @.***> wrote:

The perf speed up isn’t the fact that attributes are faster than api (it’s actually slightly slower rn since we parse the string to an int). The benefit is that if you ahead of time attribute everything 1) you only need to recurse through language types once (which that parse Julia type into enzyme TT was expensive so that’s now saved across all versions) 2) custom api calls in type analyses are much more expensive than the analysis just running without a plugin 3) and this is the most critical one: by putting the full info at each fn, you can get almost all the same info you need without the need to run interprocedural so if you imagine each function level TA takes time n without considering a sub call. If getting new info means a sub call would have more info, it would propagate it to its sub calls etc. so worst case scenario you could have a ping pong between some very nested call data structures causing that n to be multiplied by a stupid asymptotic (n! Maybe, would need to think)

On Wed, Jul 3, 2024 at 5:52 PM Manuel Drehwald @.***> wrote:

Naive question. But if passing TA in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into the LLVM-IR?

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/rust/issues/130#issuecomment-2207374100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXGVVGKVCMQMSFIYYLLZKRXCHAVCNFSM6AAAAABKKK6OZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGM3TIMJQGA . You are receiving this because you were assigned.Message ID: @.***>

ZuseZ4 commented 4 months ago

@wsmoses Thanks for the extensive list. I should now only have tt in the IR (also making it much easier to test).

; Function Attrs: noinline nonlazybind sanitize_hwaddress uwtable
  9698   define dso_local "enzyme_type"="{}" void @ffff(ptr align 8 "enzyme_type"="{[-1]:Pointer, [-1,0]:Integer, [-1,1]:Integer, [-1,2]:Integer, [-1,3]:Integer, [-1,4]:Integer,          [-1,5]:Integer, [-1,6]:Integer, [-1,7]:Integer, [-1,8]:Pointer, [-1,8,-1]:Float@double, [-1,16]:Integer, [-1,17]:Integer, [-1,18]:Integer, [-1,19]:Integer, [-1,20]:Integ         er, [-1,21]:Integer, [-1,22]:Integer, [-1,23]:Integer}" %0, ptr align 8 "enzyme_type"="{[-1]:Pointer, [-1,-1]:Float@double}" %1) unnamed_addr #15 personality ptr @rust_e         h_personality !dbg !1312 {

The only issue is that now Enzyme seems to throw an exception:

fatal runtime error: Rust cannot catch foreign exceptions
warning: `ad` (bin "ad") generated 1 warning
error: could not compile `ad` (bin "ad"); 1 warning emitted

Presumably bc. I pass the wrong things to the API. I tried passing both nullpointers here, as well as empty tt:

   2     let mut args_tree = vec![TypeTree::new().inner; input_tts.len()];
     1     let ret_tt = TypeTree::new().inner;
  899      let dummy_type = CFnTypeInfo {
     1         Arguments: args_tree.as_mut_ptr(),
     2         Return: ret_tt.clone(),
     3         KnownValues: known_values.as_mut_ptr(),
     4     };

What is the correct value here now @wsmoses ?