Array initialization failing for GPU code generation

stonea commented 2 years ago

This is filing an issue based off of the comment here: https://github.com/chapel-lang/chapel/issues/18858#issuecomment-999181985

What I see is that for GPU code generation if I compile the following:

var A : [1..10] real = 1..10;

I get the following error messages:

+ chpl foo.chpl -o foo
warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
warning: Unknown CUDA version. version.txt: 11.0.228. Assuming the latest supported version 10.1
internal error: Cannot generate unknown type [codegen/cg-type.cpp:52]
warning: Unknown CUDA version. version.txt: 11.0.228. Assuming the latest supported version 10.1
error: opening /tmp/chpl-stonea.deleteme-EJtBtV/chpl__gpu.fatbin: No such file or directory

The warnings are unrelated issues, this is the error I'm interested in for this issue: internal error: Cannot generate unknown type [codegen/cg-type.cpp:52]

If I remove the initialization and compile:

var A : [1..10] real;

Things work fine.

In this comment (https://github.com/chapel-lang/chapel/issues/18858#issuecomment-1005044597) Engin points out that "we switched to creating kernels for all foralls. I think that line turns into a forall that the loop analyzer thinks it's OK to create a kernel for, but it causes some issues"

bradcray commented 2 years ago

Just a curiosity on my part, but does using = 1.23; as an initializer work any better?

stonea commented 2 years ago

Just a curiosity on my part, but does using = 1.23; as an initializer work any better?

Just tried it and that seems to work fine. Not sure why, I assume in either case it lowers to a forall loop.

e-kayrakli commented 2 years ago

We might be generating a range argument to the GPU kernel in the case in the OP, whereas this probably results in just a real argument. So it may have to do with not being able to pass ranges to GPU kernels for some reason.

bradcray commented 2 years ago

Not sure why, I assume in either case it lowers to a forall loop

I think it might be the difference between a zippered forall loop vs. non-, but am not sure.

[edit: E.g., I imagine that = 1..n turns into something like forall (a, i) in zip(A, 1..n) do whereas = 1.23; turns into simply forall a in A do a = 1.23;; however, array initialization is something I'm not as familiar with as array assignment, so I could be off-base in this assumption].

e-kayrakli commented 2 years ago

I got a lead, but I need to context switch.

In the case in the OP, somehow we get a temp with unknown qualifier inside the kernel. This qualifier is created when we call normalize on the outlined function that we create during GPU transformation:

    unknown call_tmp[2021598] "expr temp" "maybe param" "maybe type" "temp"
    (2021601 'move' call_tmp[2021598](2021530 'cast' real(64)[16] chpl_simt_index[2021476]))
    (2021528 '=' init_coerce_tmp[2021526] call_tmp[2021598])
    (2021533 '=' call_tmp[2021519] init_coerce_tmp[2021526])

If you separate var A : [1..10] real = 1..10; into two:

var A : [1..10] real;

forall (a, i) in zip(A, 1..n) do a=i;

It compiles fine. The relevant AST looks like the following in that case:

    const-val coerce_tmp[2016998]:real(64)[16] "dead after last mention" "coerce temp" "insert auto destroy" "temp"
    (2017000 'move' coerce_tmp[2016998](2017002 'cast' real(64)[16] chpl_simt_index[2016948]))
    (2017005 '=' call_tmp[2016991] coerce_tmp[2016998])

It is slightly less normalized with all the qualifiers set right.

While I cannot directly relate the observed error to this, it is clearly an issue, and the only one I can find on a quick look.

e-kayrakli commented 2 years ago

I always felt a bit uneasy calling normalize that late in compilation and not calling anything to resolve the function afterwards. I tried calling tryResolveFunction after normalize, but it didn't help. There might be a bandaid that we can put on this, but maybe we just need to create all our AST in a normalized structure while creating the GPU code. I don't know how burdensome that is, though.

stonea commented 2 years ago

Not to get too off track (let me know if you feel I should move this over to Slack or something) but normalization seems like one of those things that's fundamental enough I better ask now before I'm more senior on the team and would feel sheepish about asking.

I understand normalization (as a general concept), but what does it mean for Chapel? If I want to write IR post-normalization how do I know that what I'm doing is in the normal form?

I guess we have some (perhaps out-of-date) documentation on it here (on page 26): https://github.com/chapel-lang/chapel/blob/main/doc/rst/developer/implementation/compilerOverview/compilerOverview.pdf

Is there a simple explanation for normalization or is normalization a "lot of little things" that can't really be covered in a quick github comment. Is the expectation that we should call normalize or checkNormalize every-time we modify\add IR post normalization or as a Chapel developer should I have a good sense of what normalized vs non-normalized IR looks like?

bradcray commented 2 years ago

I understand normalization (as a general concept), but what does it mean for Chapel?

I was dragging my feet on this, hoping that someone who works more in the compiler than I do these days would answer first (because my impressions are probably not much more up-to-date than that compiler overview doc you linked, but:

The original, main, goal of normalization was to reduce the complexity of the AST/IR that downstream passes would need to see. In its most basic form, I think of it as being the equivalent of generating 3-address code in a conventional compiler, but at a higher-level, so more like 3-expression code such that each statement, regardless of its complexity, gets transformed into something like x=y or x=foo(y,z);. The goal of this was to make subsequent passes like function resolution simpler because they'd only have to deal with a smaller number of patterns, but I was skeptical at the time that it was really necessary (vs., say, simply relying on recursion very heavily). I'm not absolutely confident that I was right, but have regretted not fighting more forcefully for my position over the years, esp. as normalization has become more and more of a thorn in our side. Why is it a thorn? In part because aspects of the user's code are obscured by getting changed from a compact statement to a series of adjacent ones; in part because it introduced new special cases if/when we didn't want the temporaries inserted to represent sub-expressions to be fully realized; in part because it blows up our code size and slows down compilation.

So that's what I think of Chapel normalization as being classically. But then the other part of it is that the "normalize" pass over time has taken on a number of other small sub-transformations / sub-passes that are part of it because (a) it feels like a reasonable time to do the checks/transformations; (b) over time, adding new passes to the compiler has become painful / discouraged. You can get a sense of what kinds of things have accumulated here by looking at some of the routine names that are called in the normalize() routine in normalize.cpp. As a result of this accumulation, when talking about normalize/normalization in Chapel, it can be important to distinguish between whether we're talking about "all the things that happen during normalization" or "the process of turning an arbitrary AST into this three-expression form".

My descriptions are still somewhat vague / high-level, but hopefully this is useful as a starting point.

stonea commented 2 years ago

Thanks for taking the time to respond Brad; this is helpful for me.

In its most basic form, I think of it as being the equivalent of generating 3-address code in a conventional compiler, but at a higher-level, so more like 3-expression code such that each statement, regardless of its complexity, gets transformed into something like x=y or x=foo(y,z);

Ok, got it, I think that explanation would let me easily see\identify normalized\non-normalized code.

Why is it a thorn? In part because aspects of the user's code are obscured by getting changed from a compact statement to a series of adjacent ones

I suppose that's a concern for error handling \ debugability reasons (that is if there was a gdb\lldb style debugger for Chapel code)? Any optimizations\transforms that could leverage the higher level information we'd just do earlier, right?

in part because it introduced new special cases if/when we didn't want the temporaries inserted to represent sub-expressions to be fully realized

So are you saying normalization itself won't apply itself to certain pattern matched code (for optimization reasons). Or do we have some kind of post-normalization optimization comes in to reduce the coat bloat?

So that's what I think of Chapel normalization as being classically. But then the other part of it is that the "normalize" pass over time has taken on a number of other small sub-transformations / sub-passes that are part of it because (a) it feels like a reasonable time to do the checks/transformations; (b) over time, adding new passes to the compiler has become painful / discouraged.

So should I think about these "extras" as being a quintessential part of normalized code (for correctness) or kind a bonus optimization that we just happen to do at the same time.

I guess my concern is, if I'm hand writing IR post-normalization and for whatever reason I don't want to rerun normalization do I need to concern myself with performing these "extras" by hand, or can I just mentally keep myself concerned with the simpler definition of normalization.

As a result of this accumulation, when talking about normalize/normalization in Chapel, it can be important to distinguish between whether we're talking about "all the things that happen during normalization" or "the process of turning an arbitrary AST into this three-expression form".

I suppose my previous question relates to your point here.

To relate this back to Engin's comments:

we get a temp with unknown qualifier inside the kernel. This qualifier is created when we call normalize on the outlined function that we create during GPU transformation

And later:

I always felt a bit uneasy calling normalize that late in compilation and not calling anything to resolve the function afterwards. I tried calling tryResolveFunction after normalize, but it didn't help.

So is the issue (or we think the issue might be) that the normalization pass introduces unresolved function calls? I suppose if I'm adding IR in (say) pass 35 and rerun normalization, which say happens at pass 30, I'd need to concern myself with everything that happens in passes 31-34 as well.

I guess this means if we introduce IR past normalization it isn't enough to just to pass it through normalize() and move on.

e-kayrakli commented 2 years ago

@bradcray -- Thanks for the response. I'll try to answer some more specific questions, but feel free to correct me.

I suppose that's a concern for error handling \ debugability reasons (that is if there was a gdb\lldb style debugger for Chapel code)? Any optimizations\transforms that could leverage the higher level information we'd just do earlier, right?

Right. Post-normalize, it is significantly more difficult to relate the AST to the original user code. As you said, this makes debugging hard, especially for beginners. More importantly, though, it makes complicated analysis a bit difficult. Most of the optimizations I wrote had to workaround this in ways that made them noticeably more complicated. Especially because we resolve after we normalize (well we normalize early to make resolution early, so Catch 22), when you want to make a high-level analysis, you don't have the types yet.

So, currently, we have more than 1 optimization that starts before normalization, adds bunch of new AST and markers because we cannot make decisions at this time in compilation, then at resolution time the optimization is completed. I think the new compiler will fix that to some extent by resolving without normalizing (?)

So are you saying normalization itself won't apply itself to certain pattern matched code (for optimization reasons). Or do we have some kind of post-normalization optimization comes in to reduce the coat bloat?

In this context, I think the former. I remember seeing unexpected not-so-normalized AST especially around (forall?) loops.

We do have a denormalize pass that is an optimization to eliminate temporaries introduced by normalization. It is sort of like an esoteric copy-propagation optimization (we also have a separate copy-prop pass). But denormalize kicks in right before codegen. The purpose of it is just to reduce the code bloat, and it doesn't help with the "AST having too many temporaries and being too different from the user code for optimization purposes" issue.

So should I think about these "extras" as being a quintessential part of normalized code (for correctness) or kind a bonus optimization that we just happen to do at the same time.

I guess my concern is, if I'm hand writing IR post-normalization and for whatever reason I don't want to rerun normalization do I need to concern myself with performing these "extras" by hand, or can I just mentally keep myself concerned with the simpler definition of normalization.

I think more "bonus". That is why I typically choose to call normalize on the new AST I am creating post-normalize, so that I don't have think about all the other things that happen during normalize. That typically works fine. Typically.

Aside for this specific example

I think calling `normalize` on the full `outlinedFunction` can be an overkill. We create that function from a loop body that must have been mostly normalized already. We add new stuff for index calculation, which we may already have been adding in a normalized way (I need to check). Even if we don't, we can first create a `BlockStmt` for that new part of the code. Insert all that stuff in a non-normal way into that block. Normalize the block. And finally `flattenAndRemove` it.

So is the issue (or we think the issue might be) that the normalization pass introduces unresolved function calls? I suppose if I'm adding IR in (say) pass 35 and rerun normalization, which say happens at pass 30, I'd need to concern myself with everything that happens in passes 31-34 as well.

I guess this means if we introduce IR past normalization it isn't enough to just to pass it through normalize() and move on.

Right. Probably a normalize call should be followed by something to resolve the normalized code. Here, I cannot make that happen easily. The reason is that we don't have a proper call to the GPU function and normally you resolve a function for a call to support generics etc.

I think what happens here is that there is some statement in the block that we are normalizing, and normalize doesn't find it to be sufficiently normal. Then, it introduces a new temp for that statement. However, we never go and try to resolve that new temp at all. So, it doesn't get its const qualifier.

chapel-lang / chapel

Array initialization failing for GPU code generation #18914