First draft comments - Githubissues

Introduction
- Say not just that it is a case study for Python for HPC, but a case study for a performance critical algorithm with complex data structures.
- Why is the library only single precision? Right now it does all on CPU, so restricting to single-precision does not make sense.
Fast Multipole Methods
- I don't think your kernel includes the permittivity parameter in the implementation (or does it?) This is confusing in the text
- Your description of local and multipole expansion is confusing. A multipole expansion is valid in the interior of a the box that produces the charges. A local expansion is valid in the interior of the box.
- In your implementation, since translation operators are based on low-rank approximations, there is a loss of accuracy when translating expansions. But we have control over it.
- Just 'decay characteristic' is not enough. The Helmholtz kernel decays at the same rate as the Laplace kernel but cannot be effectively low-rank approximated, depending on region and wavenumber, Smoothness and non-oscillatory behaviour are the important properties.
- You talk about analytical expansion generically for the FMM to achieve O(N). Only the classical FMM uses analytical expansions. KIFMM also achieves O(N) behaviour. Key is the constant cost per box.
- The KIFMM also works for oscillatory problems. The issue is however, that the ranks become dependent on the rate of oscillation.
- Why do you talk about truncated SVD and square linear systems? I think you use a least-squares solve with oversampling. You should state this.
- Give reference to a paper explaining U, V, and X lists.
Numba
- Careful with AVX-512. Clang does not enable it automatically, and since Numba is based on clang I think it doesn't either. Issue is that CPUs have to clock-down a lot for AVX-512, which costs time.
- Numba failing silently is misleading. It depends on whether one uses the object mode or not.
- Numba does not achieve high-performance for array operations. It is Numpy, which does this. If I only have operations with large arrays I would not recommend Numba.
- Spread eq 17 over two lines.
Performance Comparisons
- The performance section is lacking information. First, what is meant with SOTA?
- Where does PyExafmm actually lose the performance?
- You should give comparisons of the individual operator types.
- Benchmark specific translation operators in Exafmm and Pyexafmm against each other. and explain performance differences.

Introduction
- Say not just that it is a case study for Python for HPC, but a case study for a performance critical algorithm with complex data structures.
- Sri: Yes you're right this is important, I'll add this.
- Why is the library only single precision? Right now it does all on CPU, so restricting to single-precision does not make sense.
- Sri: The library was originally double precision and single precision, however performance was degraded quite a lot for some of the operators. I haven't yet debugged this, or precisely understand why. This is why I've restricted it for now.
Fast Multipole Methods
- I don't think your kernel includes the permittivity parameter in the implementation (or does it?) This is confusing in the text
- Sri: It doesn't, but the physics problem had it, so wasn't sure whether it would be unclear if it didn't contain it for someone who knew the equation, I'll remove it.
- Your description of local and multipole expansion is confusing. A multipole expansion is valid in the interior of a the box that produces the charges. A local expansion is valid in the interior of the box.
- In your implementation, since translation operators are based on low-rank approximations, there is a loss of accuracy when translating expansions. But we have control over it.
- Just 'decay characteristic' is not enough. The Helmholtz kernel decays at the same rate as the Laplace kernel but cannot be effectively low-rank approximated, depending on region and wavenumber, Smoothness and non-oscillatory behaviour are the important properties.
- Sri: I should clarify these points, however I was trying to avoid being too specific on purpose.
- You talk about analytical expansion generically for the FMM to achieve O(N). Only the classical FMM uses analytical expansions. KIFMM also achieves O(N) behaviour. Key is the constant cost per box.
- Sri: I thought that I had made this clear, with a paragraph on how the translations + constant cost per box. But I can reword this.
- The KIFMM also works for oscillatory problems. The issue is however, that the ranks become dependent on the rate of oscillation.
Sri: I had a section discussing this in earlier drafts, but deleted it due to word constraints.
- Why do you talk about truncated SVD and square linear systems? I think you use a least-squares solve with oversampling. You should state this.
- Sri: No I do use a truncated SVD to calculate a pseudo inverse not the least-squares.
- Give reference to a paper explaining U, V, and X lists.
Numba
- Careful with AVX-512. Clang does not enable it automatically, and since Numba is based on clang I think it doesn't either. Issue is that CPUs have to clock-down a lot for AVX-512, which costs time.
- Sri: I read about this on the Numba docs, but I've never tested it myself.
- Numba failing silently is misleading. It depends on whether one uses the object mode or not.
- Sri: Yes this isn't specific, I meant to say that it defaults to object mode, but appears to still work.
- Numba does not achieve high-performance for array operations. It is Numpy, which does this. If I only have operations with large arrays I would not recommend Numba.
- Sri: I meant to say something more along the lines of Numba loading values from arrays, as well as having a link to Numpy with its optimized methods (multiplication addition etc)
  - Spread eq 17 over two lines.
- Performance Comparisons
  - The performance section is lacking information. First, what is meant with SOTA?
- Sri: This abbreviation is specified, just early in the text. 'state of the art'
- Sri: The problem is word limit, the content is already at 5200 words + 1000 words for figures + tables. which is hitting the limit of 6250. I thought that a brief results section would suffice as long as I discussed some aspects of how to use Numba. However I can remove words from the introductory sections.
- Where does PyExafmm actually lose the performance?
- Sri: This is not specific enough, but it loses performance in the M2L step which dominates runtimes. I should do another experiment for this.
- You should give comparisons of the individual operator types.
- Sri: Yes I will do this.
- Benchmark specific translation operators in Exafmm and Pyexafmm against each other. and explain performance differences.
- Sri: Yes I will do this

betckegroup / pyexafmm-cise

First draft comments #1