NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.
https://timeloop.csail.mit.edu/
BSD 3-Clause "New" or "Revised" License
303 stars 99 forks source link

Exposing Topology internals for PyTimeloop #237

Closed rengzhengcodes closed 5 months ago

rengzhengcodes commented 5 months ago

Motivation:

angshuman-parashar commented 5 months ago

Some questions:

  1. Can you explain the rationale behind the 2-stage storage accessors, i.e., private GetXXX() and the wrapper public ExposeXXX()? Why not just change the existing methods to public?
  2. Once we converge on these changes, could you back-port them to v3.0.3? I think it should be a clean merge.
rengzhengcodes commented 5 months ago

Addressing 1, the difference is that the exposure explicitly ensures the return type is const, which is something the getters do not do. I was unsure if this was a deliberate design choice with the private getters, being either a concern with mutating objects outside of the class, or if it could be const'd with no effect, so I chose my implementation to make this explicit.

I believe I could backport this; I will do so once I find more information about the above. From skimming the code it seems likely that the getters could just be made public and the const. modifiers added, but I will need to verify this myself later tonight.

angshuman-parashar commented 5 months ago

Yes, if you can just const the existing methods and everything works that's great because (1) the code is cleaner, (2) the diff is cleaner, and (3) they probably should have been const to begin with!

rengzhengcodes commented 5 months ago

Alright, this compiles and works on my end. I think the inputs can also be const'd, but given they're passed by value i see no reason to. Let me know if any other changes need to be made.

angshuman-parashar commented 5 months ago

It's not building on my system (Darwin/clang). There are places in topology.cpp (lines 461, 469, 707) where the return value is being stored in a shared_ptr to a non-const object. If I fix those to const, the compiler is unhappy about line 462 where GetSpecs() is now being called on the const object.

I can try to follow the rabbit hole a little deeper, but I thought I would check with you to make sure you hadn't fixed it already (and perhaps forgot to check it in).

rengzhengcodes commented 5 months ago

Hm, these aren't issues when I'm indirectly compiling timeloop through pytimeloop with the pip install -e . command. I assume this is likely because those are not being picked up. I will investigate further if those lines don't fix themselves by being const'd, otherwise, I think we may need to revert to the previous method. I believe when I looked at all the getters they are const functions themselves but are not declared as such. Perhaps that will fix the issue?

angshuman-parashar commented 5 months ago

Could you try to do a clean independent build of just the timeloop codebase? I'll continue to pull the const threads a little further and see if they end somewhere. Otherwise, I agree we should revert to your original 2-stage solution (Get-vs-Expose).

rengzhengcodes commented 5 months ago

A clean build results in similar errors to the ones you're describing, pulling at the thread further it seems there's a large chain of constable but currently unconst-ed fxns.

angshuman-parashar commented 5 months ago

Yes, I verified this too. We need to use the 2-getter approach (const + non-const). However, instead of calling the const getter Expose, could you try to use the convention described in the top answer here (i.e., same name Get for both methods): https://stackoverflow.com/questions/856542/elegant-solution-to-duplicate-const-and-non-const-getters

And then please verify that a standalone build works with this change.

It's a little concerning that the changes were not picked up when you did a pip install. Is this expected?

rengzhengcodes commented 5 months ago

This approach is having annoying issues with existing functions as well;

image

the compiler seems to have difficulty resolving the two functions despite being distinguished by more than the return type. I will dig further into this but it may be easier to revert to a different named function.

angshuman-parashar commented 5 months ago

Understood. Let's revert back to Plan A.

angshuman-parashar commented 5 months ago

Sorry, I'm seeing the same problem you referenced here: https://github.com/NVlabs/timeloop/pull/237#issuecomment-1904895210

I thought your plan was to go back to the unambiguous Expose() and Get() method names to assist the poor compiler?

rengzhengcodes commented 5 months ago

Oh sorry apologies; I was working on the changes while waiting for the reply to see if going down the rabbit's nest of const statements would in any way be productive. It seems to not be, but

image

situations like the above where a member var is directly passed to an Attribute are relatively terminal to any efforts to neat solutions in the future. I'll push what I currently have, since I managed to const some additional functions which may prove helpful to future contributors, and then revert to the Get/Expose regime.

rengzhengcodes commented 5 months ago

Hm, has something changed with the scons file? Im getting boost compile errors from NV/timeloop master (pasted below) that I've not encountered before. Setup on my end hasn't changed since my last successful timeloop compile, so I'm unsure what's causing the difference:


/usr/bin/ld: cannot find -lboost_thread: No such file or directory
/usr/bin/ld: cannot find -lboost_log_setup: No such file or directory
/usr/bin/ld: cannot find -lboost_filesystem: No such file or directory
/usr/bin/ld: cannot find -lboost_log: No such file or directory
/usr/bin/ld: cannot find -lboost_thread: No such file or directory
/usr/bin/ld: cannot find -lboost_log_setup: No such file or directory
/usr/bin/ld: cannot find -lboost_filesystem: No such file or directory
/usr/bin/ld: cannot find -lboost_log: No such file or directory
/usr/bin/ld: cannot find -lboost_thread: No such file or directory
/usr/bin/ld: cannot find -lboost_log_setup: No such file or directory
/usr/bin/ld: cannot find -lboost_filesystem: No such file or directory
/usr/bin/ld: cannot find -lboost_log: No such file or directory
/usr/bin/ld: cannot find -lboost_thread: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lboost_log_setup: No such file or directory
/usr/bin/ld: cannot find -lboost_filesystem: No such file or directory
/usr/bin/ld: cannot find -lboost_log: No such file or directory
/usr/bin/ld: cannot find -lboost_thread: No such file or directory
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
scons: *** [build/timeloop-compound-config-test] Error 1
scons: *** [build/timeloop-simple-mapper] Error 1
scons: *** [build/timeloop-design-space] Error 1
scons: *** [build/timeloop-model] Error 1
scons: *** [build/timeloop-metrics] Error 1
scons: *** [build/timeloop-mapper] Error 1
scons: building terminated because of errors.```
rengzhengcodes commented 5 months ago

Otherwise, it seems to get through all the compiler steps except for BOOST

rengzhengcodes commented 5 months ago

Upon further investigation, the turn at v3.0 doesn't cause the issue. I suspect this may be because I'm currently compiling on apple silicon which has always presented issues with --with_isl, which is mildly annoying, and if I'm not wrong this option was made default some point recently. If this compiles on your machine, it should be good to merge; unfortunately, I think it will take me a while to find a workaround to this issue. May have to resort to compiling through an x86 VM.

angshuman-parashar commented 5 months ago

I'm also on Apple silicon: M1 Pro, Ventura. No issues building either master or v3.0.3. Boost was installed via brew.

Environment has the following additional vars:

export PATH=$PATH:/opt/homebrew/bin
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/opt/homebrew/include
export C_INCLUDE_PATH=$C_INCLUDE_PATH:/opt/homebrew/include
export LIBRARY_PATH=$LIBRARY_PATH:/opt/homebrew/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/homebrew/lib

source $HOME/dev/timeloop/timeloop/env/setup-env.bash

export TIMELOOP_INCLUDE_PATH=$TIMELOOP_BASE_PATH/include
export TIMELOOP_LIB_PATH=$TIMELOOP_BASE_PATH/lib
export PATH=$PATH:$TIMELOOP_BASE_PATH/bin
rengzhengcodes commented 5 months ago

Ah, I see. Seems a recent bit of cleanup on my end resulted in BOOST's associated env variables being deleted. It now fully compiles, but I am getting the following warnings:

lto-wrapper: warning: using serial compilation of 47 LTRANS jobs
lto-wrapper: warning: using serial compilation of 49 LTRANS jobs
lto-wrapper: warning: using serial compilation of 39 LTRANS jobs

with $scons -j8 --static. However, it does compile with both scons and pip.

Some notes in case this PR needs to be referenced again:

  1. Use of implicit "this" requires any rewrite or addition with strict const requirements to either extensively add const to existing functions or such a method.
  2. I've swapped to using "View" as a clearer alternative to "Expose" for the cases of const output const "Get" functions. Getters now wrap view in a similar method to that described in the SO post, as this should make any future attempts to collapse "View" into "Get" easier.
angshuman-parashar commented 5 months ago

I'm also running into issues with --static (on the master branch): ld: library not found for -lcrt0.o

It seems a fully static build is not doable on Darwin: https://stackoverflow.com/questions/3801011/ld-library-not-found-for-lcrt0-o-on-osx-10-6-with-gcc-clang-static-flag

I'm sure we can tweak the SConscript per the above article's recommendations to make a static build work, but I'm ignoring this for now until someone lights a fire.