Maximizing PE array utilization in convolution runs

NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.

https://timeloop.csail.mit.edu/

BSD 3-Clause "New" or "Revised" License

341 stars 104 forks source link

Maximizing PE array utilization in convolution runs #267

Open DanP114 opened 5 months ago

DanP114 commented 5 months ago

Hello there,

I am currently working on designing a 200 by 200 PE Convolution accelerator. I have taken the base template from the exercise provided and read through some documentation but my mapping strategies return with about 1-2% utilization.

Here are my input architecture files, parsed_input, generated map, and statistics showing utilization.

My inner PE spatial loop bounds seem to only unroll along the Y-axis with nothing in the X-axis. I believe the issues come from the constraints definition but I also have the intution problem dimensions (VGG) are not suited for a large PE array hence why I try mapping more batches.

Any input is appreciated.

arch_conv.txt parsed-processed-input-large-pe-array-multi-batch.txt

timeloop-mapper.stats.txt timeloop-mapper.map.txt

angshuman-parashar commented 3 months ago

There's something odd. Your spec appears to be creating a 200x200 array but the stats.txt reports 16x16 instances at all inner levels of the hierarchy. Are you sure the stat dump is from this arch?

Overall a 200x200 array is hard to fill spatially. Most mappings will be underutilized, so I suspect the mapper search is just giving up too quickly. Try tweaking the hyperparameters to make it try harder. Also, in your innermost buffer constraints you should add a min parallelism constraint (e.g., 0.5). This will early-reject any mappings that don't have at least 50% utilization. You won't prevent the search heuristic from visiting such mappings, but you will elide the expensive evaluation cost for these mappings.

chipletstu commented 2 weeks ago

There's something odd. Your spec appears to be creating a 200x200 array but the stats.txt reports 16x16 instances at all inner levels of the hierarchy. Are you sure the stat dump is from this arch?

Overall a 200x200 array is hard to fill spatially. Most mappings will be underutilized, so I suspect the mapper search is just giving up too quickly. Try tweaking the hyperparameters to make it try harder. Also, in your innermost buffer constraints you should add a min parallelism constraint (e.g., 0.5). This will early-reject any mappings that don't have at least 50% utilization. You won't prevent the search heuristic from visiting such mappings, but you will elide the expensive evaluation cost for these mappings.

Hello,

I have a question about how to add a min parallelism constraint (e.g., 0.5) in my innermost buffer constraints. Can you give me an example?

Thanks.