General Refactoring Planning

kerbowa commented 4 years ago

Recent discussion in #45 and elsewhere has made it clear that we should make refactoring a priority. We have limited resources, however I believe that a strong argument can be made that these changes will help the project grow and have a positive net effect.

It seems like most active contributors feel similarly, so lets prioritize some of the idea's, and then decide how much time should be invested into this.

Why Refactor

This is an open source project. We would like to increase the number of collaborators. Companies have contacted us wanting to try our these scheduling techniques only to be discouraged by the lack of usability. The open source community respects well maintained code, and if the goal is to invite more collaboration we need to have things like easy-of-use, documentation, CI, ect in place.

The lack of any real testing infrastructure is a big problem. If the end goal is to have this be a tool recognized by the LLVM community, like say Poly for example, then this type of refactoring work is how we should start along that path.

What work needs to be done

Testing infrastructure. We should decide on lit, filecheck, gtest or some combination. In my opinion this is the most critical missing piece right now. It's impossible for new people to navigate the project without unit tests which can show them how the pieces of the scheduler work together.
Refactor scheduling pass manager. Each scheduling pass should be independent. There should be classes which represent a scheduling pass and the code should not rely on flags to change behavior for different passes.
The older code needs to be modernized. That means smart pointers, and variable names that are descriptive. It should be written with C++14 and LLVM programming style in mind. We should use iterators/vectors/maps ect and leverage the LLVM infrastructure whenever possible.
Use tablegen to generate the sched.ini options. The way we handle scheduler settings is horrible. It should be possible for all flags to be added on the CL as well as read from sched.ini. We also need to have sane defaults so that things work without messing with ini files.

I hope we can create some github issues with tasks that will make contributing to this project much easier for new people.

Quincunx271 commented 4 years ago

I've also noticed that we have a lot of protected member variables. It would be much easier to follow the code if these were private variables; that way, I can know what can and can't be changed at a given point of time.

Quincunx271 commented 4 years ago

I would also like to suggest renaming of functions and variables. UpdtRdyLst_ is not much shorter than UpdateRdyList_ or UpdateReadyList_, but those are much easier to read. There's a pervasive removal of vowels: Dynmc, PrirtyLst, RcrsvScsr.

Some shortenings are reasonable (Reg), but there should be very few of these. The pattern I've noticed is that I'm generally fine if it's a single syllable of the larger word, rather than the larger word with vowels removed: Reg vs Rgstr

kerbowa commented 4 years ago

I would also like to suggest renaming of functions and variables.

Agreed. It's frustrating if you don't remember exactly how a variable was spelled. Using English spelling fixes that problem :)

Quincunx271 commented 4 years ago

I don't know if this is a good idea: inline base classes for bases with only one derived type.

For example, GraphNode is only ever a SchedInstruction. It may make the code simpler to get rid of GraphNode entirely in favor of just having SchedInstruction, perhaps leaving GraphNode as a pure interface (no member variables).

Quincunx271 commented 4 years ago

Renaming variables/functions: how do we want to do this? One massive PR seems like a bad idea. That would be impossible to review. But if we split it into tons of PRs, we could easily end up with 10s of PRs. It's not entirely possible to restrict it to just a few files at a time, because we need to fix references to the functions/variables as well.

The options I can think of:

One massive PR (bad option IMO)
Do roughly one file at a time, e.g. alphabetically.
Do one kind of thing at a time (Cnt, Elmnt, Nxt, Frst). This would have a lot of overlapping lines.
As we do other things in the code, rename variables that are nearby what we are working on.

I think that either the 2nd or 4th option are the most realistic. The 4th would take a long time, but I think it may be the most realistic.

kerbowa commented 4 years ago

I think option 3, could be one or two "things" at a time. Maybe the particularly annoying names could be fixed first. For most it's probably just sed and replace.

Quincunx271 commented 4 years ago

Through my quick experiments, some of those already change 1000+ lines of code. That's an unreviewable size. But yeah, multiple things at once is possible. I agree that we should probably focus on the most annoying names first.

vangthao95 commented 4 years ago

Option 3 and 4 sounds good to me. I also agree that annoying names could be fixed first like UpdtRdyLst_. We could also focus on files that are worked on often like bb_spill, sched_region, sched_basic_data, etc.

Quincunx271 commented 4 years ago

I can't find examples inside the LLVM codebase of using TableGen for commandline arguments, which is what I'm understanding the proposal number 4 for replacing sched.ini is talking about. Instead, it seems that https://llvm.org/docs/CommandLine.html is used. Instead of using a sched.ini file, it's possible to load commandline flags from a file (clang argument: --config).

kerbowa commented 4 years ago

I was thinking of building our own tablegen generator. Although having everything be a LLVM CL option is simpler. We would still need some code to pass the options around. I was hoping to have something like SubtargetFeatures https://github.com/llvm/llvm-project/blob/master/llvm/utils/TableGen/SubtargetFeatureInfo.cpp.

Also, clang options are defined via talbegen: https://github.com/llvm/llvm-project/blob/master/clang/include/clang/Driver/Options.td

I think the main issue with the current setup is that retrieving options can be unnecessarily expensive because of the way our config class is implemented.

I'd not heard of reading in CL options from a file before. Apparently that was added in LLVM 6.0. This is only for clang though right? One advantage of the current setup is that it doesn't matter how you are using the scheduler. We are not always doing it via clang. For example we can use llc or AMDGPU and comgr. With --config I think each setup would have a different process for changing options.

Ideally what we implement would have these features:

Config can be changed via CL options or an ini file.
Looking up options is cheap, and can be done using a single class.
We read in options once, instead of each time the MachineScheduler is instantiated.

CSUS-LLVM / OptSched

General Refactoring Planning #47

Why Refactor

What work needs to be done