Closed kathjo closed 5 years ago
We've had numerous discussions with @kathjo on how to make this happen, and it's likely that the costs will outweigh the benefits (in terms of e.g. doing too small-granularity DRAM accesses, which will cause a massive bandwidth efficiency reduction) and decided not to pursue this route further. The intention here was to enable larger matrices by overcoming the fetch DRAM block size limitation, and there are easier alternatives for achieving this.
Here's the branch with the fetch strategy over k-tiles that we discussed on Friday, where there seems to be an issue with the synchronisation.