EnzymeAD / rust

A rust fork to work towards Enzyme integration
https://www.rust-lang.org
Other
52 stars 7 forks source link

reproducability between rustc and opt #130

Closed ZuseZ4 closed 5 days ago

ZuseZ4 commented 5 days ago

We have an issue where we aren't able to reproduce a bug that shows up in rustc+enzyme through opt. I have a flag ENZYME_PRINT_MOD_BEFORE which gives the whole module llvm-ir (fat lto etc.) just before calling Enzyme, so we were expecting that writing a dummy function that calls __enzyme_autodiff and using the opt plugin (on the same llvm) would give the same error. For which reasons could this fail? I was wondering about adding a flag that dumps all the TT for the outermost function into the IR (like we do it with mem* functions already). Those are otherwise directly passed to Enzyme and thus not part of the IR. Could there be another reason why we fail to reproduce bugs?

wsmoses commented 5 days ago

You can export the type trees directly onto arguments as attributes so there is no additional data added from the api call not otherwise in the module

ZuseZ4 commented 5 days ago

Agree, that's

I was wondering about adding a flag that dumps all the TT for the outermost function into the IR (like we do it with mem* functions already).

My question was, are there other reasons due to which the ABI could differ from the opt pass?

wsmoses commented 5 days ago

I mean I wouldn’t even bother with a flag just pass it as a fn attribute always and don’t bother with the api version and you can be sure that’s not it

wsmoses commented 5 days ago

I think the other big question is ensuring both the same optimization and analysis state is available.

llvm registers some analyses as available so that could be one cause if opt doesn’t register the same before running passes

ZuseZ4 commented 5 days ago

Didn't you say that the there was a huge compile time penalty of this (IIRC on the julia side), or was that in another case? Yes, if compile times are the same I'd change it unconditionally.

wsmoses commented 5 days ago

There was a huge compile time benefit

ZuseZ4 commented 5 days ago

Rip. Yes then I'll try that and come back if we still get different results. thx.

wsmoses commented 5 days ago

Specifically it alongside the no ta recur flag enable a mostly losses version of the info of TA except now instead of doing a very expensive interprocedural fixed point across all values in all functions, it’s now factored to only needed to reach a fixed point in a single function (for the most part).

It’s not the be all end all to compile times, but means big programs should no longer have catastrophic TA analysis runtime by virtue of purely being big. We still need to port into c++ tho

ZuseZ4 commented 5 days ago

Naive question. But if passing TT in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into fn attributes in the LLVM-IR?

wsmoses commented 5 days ago

The perf speed up isn’t the fact that attributes are faster than api (it’s actually slightly slower rn since we parse the string to an int). The benefit is that if you ahead of time attribute everything 1) you only need to recurse through language types once (which that parse Julia type into enzyme TT was expensive so that’s now saved across all versions) 2) custom api calls in type analyses are much more expensive than the analysis just running without a plugin 3) and this is the most critical one: by putting the full info at each fn, you can get almost all the same info you need without the need to run interprocedural so if you imagine each function level TA takes time n without considering a sub call. If getting new info means a sub call would have more info, it would propagate it to its sub calls etc. so worst case scenario you could have a ping pong between some very nested call data structures causing that n to be multiplied by a stupid asymptotic (n! Maybe, would need to think)

On Wed, Jul 3, 2024 at 5:52 PM Manuel Drehwald @.***> wrote:

Naive question. But if passing TA in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into the LLVM-IR?

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/rust/issues/130#issuecomment-2207374100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXGVVGKVCMQMSFIYYLLZKRXCHAVCNFSM6AAAAABKKK6OZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGM3TIMJQGA . You are receiving this because you were assigned.Message ID: @.***>

wsmoses commented 5 days ago

To be clear there’s still bad perf possible failure modes here, but now they’re limited to at most the size of a function instead of the entire program (hence the claim that large applications don’t suffer as a result).

On Wed, Jul 3, 2024 at 5:58 PM Billy Moses @.***> wrote:

The perf speed up isn’t the fact that attributes are faster than api (it’s actually slightly slower rn since we parse the string to an int). The benefit is that if you ahead of time attribute everything 1) you only need to recurse through language types once (which that parse Julia type into enzyme TT was expensive so that’s now saved across all versions) 2) custom api calls in type analyses are much more expensive than the analysis just running without a plugin 3) and this is the most critical one: by putting the full info at each fn, you can get almost all the same info you need without the need to run interprocedural so if you imagine each function level TA takes time n without considering a sub call. If getting new info means a sub call would have more info, it would propagate it to its sub calls etc. so worst case scenario you could have a ping pong between some very nested call data structures causing that n to be multiplied by a stupid asymptotic (n! Maybe, would need to think)

On Wed, Jul 3, 2024 at 5:52 PM Manuel Drehwald @.***> wrote:

Naive question. But if passing TA in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into the LLVM-IR?

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/rust/issues/130#issuecomment-2207374100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXGVVGKVCMQMSFIYYLLZKRXCHAVCNFSM6AAAAABKKK6OZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGM3TIMJQGA . You are receiving this because you were assigned.Message ID: @.***>

wsmoses commented 5 days ago

Also I will note effective usage of this requires the language to provide full type trees for all args and returns for all functions (useful even with this off, but if you had interprocedural on it would spend time to possibly derive this even if not specified).

We do this for Julia and I just haven’t gotten to doing that for c++ yet

On Wed, Jul 3, 2024 at 6:01 PM Billy Moses @.***> wrote:

To be clear there’s still bad perf possible failure modes here, but now they’re limited to at most the size of a function instead of the entire program (hence the claim that large applications don’t suffer as a result).

On Wed, Jul 3, 2024 at 5:58 PM Billy Moses @.***> wrote:

The perf speed up isn’t the fact that attributes are faster than api (it’s actually slightly slower rn since we parse the string to an int). The benefit is that if you ahead of time attribute everything 1) you only need to recurse through language types once (which that parse Julia type into enzyme TT was expensive so that’s now saved across all versions) 2) custom api calls in type analyses are much more expensive than the analysis just running without a plugin 3) and this is the most critical one: by putting the full info at each fn, you can get almost all the same info you need without the need to run interprocedural so if you imagine each function level TA takes time n without considering a sub call. If getting new info means a sub call would have more info, it would propagate it to its sub calls etc. so worst case scenario you could have a ping pong between some very nested call data structures causing that n to be multiplied by a stupid asymptotic (n! Maybe, would need to think)

On Wed, Jul 3, 2024 at 5:52 PM Manuel Drehwald @.***> wrote:

Naive question. But if passing TA in the IR is so much faster, why isn't Enzyme just internally adding all TT that it receives through API calls into the LLVM-IR?

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/rust/issues/130#issuecomment-2207374100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXGVVGKVCMQMSFIYYLLZKRXCHAVCNFSM6AAAAABKKK6OZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGM3TIMJQGA . You are receiving this because you were assigned.Message ID: @.***>