Exported to text (mangled equations)

Predictive Limitations of Physics-Informed Neural Networks in Vortex Shedding: Reviewer Comments

This paper addresses an important topic (predictive limitations of PINNs), does a thorough, complete, and transparent analysis as evidence of their claims, and uses good scientific practices and writing. I think this paper is clearly appropriate for the journal, clearly of interest to a broad set of readers, and is close to being accepted for publication. I recommend acceptance with minor revisions. I have two major comments and a handful of minor comments.

Major comments:

[ ] The abstract states we find that data free PINNs are unable to predict vortex shedding.
- To say that a neural network is “unable” to do something requires a much higher standard of evidence than saying that a neural net- work is “able” to do something. When a neural network achieves some task, it is an existence proof that doing so is possible. When a neural network fails to achieve some task, that doesn’t prove the task is impossible. Perhaps with a different optimizer, or a different architecture, data-free PINNs could predict vortex shedding.
- There is a second issue, which is that readers excited about the PINN approach are unlikely to agree that this conclusion follows from the experiments, because only the simplest PINN architecture and training strategies were used. As I understand it, much of the research on PINNs is dedicated to discovering tricks that allow PINNs to successfully solve PDEs that the naive approach cannot solve or doesn’t solve efficiently. To make this conclusion more persuasive, the authors should try some of these tricks and show that they don’t work.
- To some extent, neither of these issues can be completely resolved. Informative and useful null results in machine learning research can and should be accepted for publication, even if they are unable to prove conclusively that positive results are impossible to achieve.
- To resolve these issues as much as possible, I would like to see an ablation study. Ablation studies are best practice for empirical work in machine learning. See, for example, Winner’s curse? On pace, progress, and empirical rigor by Sculley et al. and Sources of Irreproducibility in Machine Learning: A Review, by Gunderson et al. Learning data-driven discretizations for partial differential equations has a well-done ablation study in the appendix.
- This ablation study can be added as an additional subsection or appendix; to reduce the amount of required effort, I suggest only performing an ablation study for the data-free, unsteady PINNs and only reporting the lift and drag coefficients (like in either Table 2 or figure 11).
- While most ablation studies remove components of neural networks to see whether those components are necessary to give a positive result, this ablation study would add components to see if they help give a positive result.
- At a minimum, the ablation study should try the following to see if they improve performance:
  1. using a so-called ‘prior dictionary’ to enforce the boundary condi- tions, see ModalPINN: An extension of physics-informed Neural Networks with enforced truncated Fourier decomposition for pe- riodic flow reconstruction using a limited number of imperfect sensors.
  2. using different activation functions, including σ(x) = sin x, see Ibid.
  3. using a loss function intended to ‘respect causality’, see Respect- ing causality is all you need for training physics-informed neural networks.
  4. using different hyperparameters.
  5. changing the coefficients of the loss function.
  6. any other tricks that people might use to improve the conver- gence or training of PINNs.
- I suggest also including the tuning methodology for the chosen hyperparameters, as this is best practice for empirical work in machine learning (see Winners Curse? by Sculley et al.). Just as it is easy to find irreproducible positive results in machine learning due to badly designed hyperparameter tuning, it is also esay to find irreproducible negative results for the same reason.
[x] Reference [7] (Rohrhofer et al.) says that the vortex shedding optimization problem with PINNs “can be resolved, e.g., by using truncated Fourier decomposition with PINNs (Raynaud et al., 2022).” Is it true that Raynaud solves Vortex shedding with data-free PINNs? If so, you should try to re- produce the result, or state prominently that it is possible to solve vortex shedding with data-free PINNs. If not, you should discuss (Raynaud et al., 2022) in the introduction/related work and state how the results from each paper are related.
- Based on my reading of Raynaud et al., they don’t solve vortex shedding using data-free PINNs. Instead, they use sparse observational data and thus are in a data-driven regime. They also assume that they know the fundamental frequency of vortex shedding, which seems like cheating to me. In any case, you should confirm that my understanding of this paper is correct.

Minor comments:

[x] While a numerical approximation (e.g., finite difference) may be a more robust choice . . . I don’t understand why finite difference would be a more ‘robust’ choice than automatic differentiation? If anything, I would expect the opposite to be true. Finite difference (to compute gradients with respect to the inputs of a PINN neural network) would have both rounding error and truncation error, and the step size may not be cho- sen appropriately. Automatic differentiation, by contrast, would return the exact gradient up to numerical precision. Finite difference is likely a more efficient choice than automatic differentiation for computing the derivatives with respect to the input. Though the runtime in both cases would be linear in the size of the neural network, when I compared the two methods the constant factor was higher with AD than with FD. Regard- less of whether AD or FD is used to compute gradients with respect to the inputs of a PINN network, AD will make the procedure for computing the gradients with respect to the parameters of the loss function much easier, and will be dramatically more efficient than FD. In summary: AD is a good tool for PINNs, but the efficiency can be increased at the cost of decreased robustness by using FD to compute the derivatives with respect to the inputs of the PINN.
[x] Figure 1 should probably say $\sum\limits_{j=1}^{10} c_j L_j^2$ for some coefficients $cj$ rather than $\sum\limits{j=1}^{10} J_j^2$, since this is how PINNs normally formulate the loss function.
[x] There are typos in references 27 and 28.
[x] reaises is a typo in the abstract.
[x] Figures 17 and 18 should label which column is PetIBM and which is the data-driven PINN.
[x] The last paragraph of the conclusion is, I believe, based on an incorrect understanding of the relationship between supervised learning and data-driven PINNs. Data-driven PINNs are meant to be given sparse observations of a physical system known to obey certain physical laws; their goal is to infer the full solution with partial knowledge of the system state and some knowledge of the governing equations. See, for example, Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. This capability is unlike anything that supervised learning can do. Thus, the two classes of methods cannot be compared quantitatively in any meaningful way.

Ablation study reference:

Sculley, D., Snoek, J., Wiltschko, A., & Rahimi, A. (2018). Winner's curse? On pace, progress, and empirical rigor. https://openreview.net/forum?id=rJWF0Fywf
Gundersen, O. E., Coakley, K., Kirkpatrick, C., & Gil, Y. (2022). Sources of irreproducibility in machine learning: A review. arXiv preprint arXiv:2204.07610. https://arxiv.org/abs/2204.07610
Bar-Sinai, Y., Hoyer, S., Hickey, J., & Brenner, M. P. (2019). Learning data-driven discretizations for partial differential equations. Proceedings of the National Academy of Sciences, 116(31), 15344-15349. https://www.pnas.org/doi/abs/10.1073/pnas.1814058116

Also:

Wang, S., Sankaran, S. and Perdikaris, P., 2022. Respecting causality is all you need for training physics-informed neural networks. arXiv preprint arXiv:2203.07404.

The abstract states [we] find that data free PINNs are unable to predict vortex shedding.

Yeah, we really need to rephrase that. We really just can say that the PINN method we tried (limited by what the Modulus framework allows) does not predict vortex shedding.

Changed to "data-free PINNs failed to predict vortex shedding in our settings."

Comit b43ce55

The abstract states we find that data free PINNs are unable to predict vortex shedding.

[x] To say that a neural network is “unable” to do something requires a much higher standard of evidence than saying that a neural net- work is “able” to do something. When a neural network achieves some task, it is an existence proof that doing so is possible. When a neural network fails to achieve some task, that doesn’t prove the task is impossible. Perhaps with a different optimizer, or a different architecture, data-free PINNs could predict vortex shedding.

Reply in previous comment.

[x] There is a second issue, which is that readers excited about the PINN approach are unlikely to agree that this conclusion follows from the experiments, because only the simplest PINN architecture and training strategies were used. As I understand it, much of the research on PINNs is dedicated to discovering tricks that allow PINNs to successfully solve PDEs that the naive approach cannot solve or doesn’t solve efficiently. To make this conclusion more persuasive, the authors should try some of these tricks and show that they don’t work.

Reply

To be honest, readers excited about PINNs are unlikely to be swayed by anything we say, as non-members of the club. We are more interested in readers who are open minded, sharing our experience of what didn't work, so they might be careful with the pull to jump into the field just because it's hot. We used in our paper an open source library by NVIDIA (Modulus) in its default setting, which we think will be the way most people will use it. Any new "tricks" would involve code modifications, with the necessary testing and code verification that this implies. We think it is good to publish negative results, even if it doesn't convince everyone. Moreover, our results are fully transparent and reproducible.

[x] To some extent, neither of these issues can be completely resolved. Informative and useful null results in machine learning research can and should be accepted for publication, even if they are unable to prove conclusively that positive results are impossible to achieve.

We agree!

[ ] To resolve these issues as much as possible, I would like to see an ablation study. Ablation studies are best practice for empirical work in machine learning. See, for example, Winner’s curse? On pace, progress, and empirical rigor by Sculley et al. and Sources of Irreproducibility in Machine Learning: A Review, by Gunderson et al. Learning data-driven discretizations for partial differential equations has a well-done ablation study in the appendix. This ablation study can be added as an additional subsection or appendix; to reduce the amount of required effort, I suggest only performing an ablation study for the data-free, unsteady PINNs and only reporting the lift and drag coefficients (like in either Table 2 or figure 11). While most ablation studies remove components of neural networks to see whether those components are necessary to give a positive result, this ablation study would add components to see if they help give a positive result. At a minimum, the ablation study should try the following to see if they improve performance:

using a so-called ‘prior dictionary’ to enforce the boundary condi- tions, see ModalPINN: An extension of physics-informed Neural Networks with enforced truncated Fourier decomposition for pe- riodic flow reconstruction using a limited number of imperfect sensors.

using different activation functions, including σ(x) = sin x, see Ibid.

using a loss function intended to ‘respect causality’, see Respecting causality is all you need for training physics-informed neural networks.

using different hyperparameters.

changing the coefficients of the loss function.

any other tricks that people might use to improve the conver- gence or training of PINNs.

[ ] I suggest also including the tuning methodology for the chosen hyperparameters, as this is best practice for empirical work in machine learning (see Winners Curse? by Sculley et al.). Just as it is easy to find irreproducible positive results in machine learning due to badly designed hyperparameter tuning, it is also esay to find irreproducible negative results for the same reason.

Reply to the last point:

We added a paragraph to explain why we only show specific combinations of hyperparameters in this work. None of other combinations we've tried would change the conclusions in this paper.

Commits: c2b3ff1625e2394eca4508a843fceb6e83fb3ead and 36a78ad72cdfb9754588faa96a45509dbef012e0

Reply to the ablation study:

While most ablation studies remove components of neural networks to see whether those components are necessary to give a positive result, this ablation study would add components to see if they help give a positive result.

To our best knowledge, the ablation study in both biology and machine learning is a technique of investigating the functionality and importance of each component in a system by removing components one by one and see the effect. This means the prerequisite of an ablation study is to have a system that is working. The system may not need to work perfectly, but at least it needs to give characteristics of our interest. In our case, we need at least one PINN that can give vortex shedding, regardless how quantitatively accurate it is. Only after we have one working PINN then we can conduct an ablation study.

Unfortunately, for data-free PINNs, we haven't had such a working case.

If we extend the meaning of ablation study by adding new things to the system and see if it works, then it is basically a trial-and-error approach, which is itself a full-scale study that requires non-trivial time to design the study, good reasoning to justify such a design (given that there are infinite new things we can try in/add to PINNs), and a full-length paper to present such a study.

On the other hand, the data-driven PINNs can generate vortex shedding in an interpolation manner. We agree that an ablation study of conventional meaning can be carried out on data-driven PINNs. And such an ablation study can indeed potentially hint us on what components in PINNs play critical roles generating vortex shedding. Again, such a study also deserve a full-length paper and can be a future work.

Commit: 9901fb9fdcdf569b13d67cb49ec414d386478664

Reference [7] (Rohrhofer et al.) says that the vortex shedding optimization problem with PINNs “can be resolved, e.g., by using truncated Fourier decomposition with PINNs (Raynaud et al., 2022).” Is it true that Raynaud solves Vortex shedding with data-free PINNs? If so, you should try to re- produce the result, or state prominently that it is possible to solve vortex shedding with data-free PINNs. If not, you should discuss (Raynaud et al., 2022) in the introduction/related work and state how the results from each paper are related.

Based on my reading of Raynaud et al., they don’t solve vortex shedding using data-free PINNs. Instead, they use sparse observational data and thus are in a data-driven regime. They also assume that they know the fundamental frequency of vortex shedding, which seems like cheating to me. In any case, you should confirm that my understanding of this paper is correct.

Reply:

The reviewer is correct: Raynaud et al. 2022 do not, offer a solution to vortex shedding with data-free PINNs. It appears that Reference [7] is incorrectly citing this paper, which uses data in the loss function (see Fig. 2 of the paper). Note this quote from section 4.1 of Raynaud et al.: "Time sampling for equations penalisation is performed over the simulation data range since the classic PINN is not able to extrapolate the periodic phenomena outside its trained time range."

(From a high-level understanding, the ModalPINN probposed by Raynaud et al. seems to be a variant of the spectral methods---which has been used for decades.)

While a numerical approximation (e.g., finite difference) may be a more robust choice . . . I don’t understand why finite difference would be a more ‘robust’ choice than automatic differentiation? If anything, I would expect the opposite to be true. Finite difference (to compute gradients with respect to the inputs of a PINN neural network) would have both rounding error and truncation error, and the step size may not be chosen appropriately. Automatic differentiation, by contrast, would return the exact gradient up to numerical precision. Finite difference is likely a more efficient choice than automatic differentiation for computing the derivatives with respect to the input. Though the runtime in both cases would be linear in the size of the neural network, when I compared the two methods the constant factor was higher with AD than with FD. Regardless of whether AD or FD is used to compute gradients with respect to the inputs of a PINN network, AD will make the procedure for computing the gradients with respect to the parameters of the loss function much easier, and will be dramatically more efficient than FD. In summary: AD is a good tool for PINNs, but the efficiency can be increased at the cost of decreased robustness by using FD to compute the derivatives with respect to the inputs of the PINN.

Reply:

The robust here refers to the ability that a scheme can work regardless of the problem types and use cases. In a nutshell, a robust method doesn't easily break down when being applied to tough or new problems. We were thinking AD and FD from the angle of general numerical methods, not just for the vanilla PINNs in this paper.

We consider FD robust because it usually just works regardless what equations we are solving or what solving procedure we use. It may not work efficiently and may not be accurate, but it works. When we don't have enough computing resource, we can choose to sacrifice the accuracy, and FD still works. Moreover, it is controllable in most of case. Most of time we know how results will change w.r.t. changes in hyper-parameters. For example, we know central difference converges in 2nd order, so if we want a specific level of accuracy, we know how to control it. We can even estimate how much computational resources and time we'll need. With tools such as Richardson extrapolation, we are also able to estimate the most accurate solution without actual running the code in very small step sizes. And using adaptive schemes (e.g., adaptive refinement, adaptive time-marching, etc.) can further reduce the need for tuning hyper-parameters like step sizes. Finally, when it comes to parallel computing, finite difference is easy to scale up using either strong-scaling or weak-scaling.

On the contrary, AD may be faster in some cases and exact w.r.t. the computational graph. But it may not always work. For example, per the authors' experiences, when the problem involves complex numbers and functions, when it involves things like integro-differential equations, when the forward propagation involves MCMC sampling, or when the forward calculation involves nonlinear and high-order derivatives (e.g., convection $uv\frac{\partial u}{\partial y}$ or diffusion $\frac{\partial}{\partial x} g \frac{\partial f}{\partial x}$). AD depends on the computational graph of the calculations, causing the high-order derivatives expensive to obtain for even the vanilla MLP neural networks. For a naive implementation of AD, the growth of the computational graph and load are likely to be exponential w.r.t. the order of derivatives. Even with the most advanced AD implementation, the growth of the computational graph and the load are unlikely to be linear. So for solving complicate equations, AD may not work due to the required computing resources (like memory). It may still be not working even if we are willing to sacrifice accuracy in return of relaxed computing resources. Not to mention that, during training, most optimizes will put one extra order of derivatives to the computational graph. So AD's performance is difficult to control and predict without knowing the details of the computational graphs. And for parallel computing, AD is hard to scale up in strong-scaling sense due to the dependency on the computational graph, though AD works well in weak-scaling.

Figure 1 should probably say $\sum\limits_{j=1}^{10} c_j L_j^2$ for some coefficients $cj$ rather than $\sum\limits{j=1}^{10} J_j^2$, since this is how PINNs normally formulate the loss function.

Reply:

Done in commit(s): f8c46c71e88a818c89dec327531ff6b931736227 and 001f47d3e3b3d4122bc92cc170f0758e82befc6a

There are typos in references 27 and 28.

Reply:

Ref 27: fixed; in title, an -> Can
Ref 28: If the typo referred to the word DIscrete, this is the exact word used in the paper's title. Nevertheless, we found this preprint was recently published in Jan 2024 in a journal, so we updated the reference with the published version.

Done in commit(s): 62c3bf6578410f8f62f4a773298db4ead0f43ea4

reaises is a typo in the abstract.

Reply:

Done in commit(s): 96fea22fd97cdbc45f30eee3dbb2aac19942080c

Figures 17 and 18 should label which column is PetIBM and which is the data-driven PINN.

Reply:

Done in commit(s): 8fff958c0b4f52dd4d76fa52eeedd29436ea2b42

The last paragraph of the conclusion is, I believe, based on an incorrect understanding of the relationship between supervised learning and data-driven PINNs. Data-driven PINNs are meant to be given sparse observations of a physical system known to obey certain physical laws; their goal is to infer the full solution with partial knowledge of the system state and some knowledge of the governing equations. See, for example, Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. This capability is unlike anything that supervised learning can do. Thus, the two classes of methods cannot be compared quantitatively in any meaningful way.

Reply:

What we wanted to convey in the last paragraph of the discussion are:

Data-free PINN can not compete with traditional numerical methods, which leaves data-driven PINN the only hope.
Data-driven PINNs only works well for interpolation not extrapolation, just like classical deep learning.
Data-driven PINNs work with sparse data, which classical deep learning is bad at
Data-driven PINNs are much more computationally expensive, nevertheless. So it is up to users to determine if it is cheaper to get more data, or it is cheaper to train a data-driven PINNs.

Revised in commit(s): 22f3980705afdfd3b8ea6ed590e91be97edcd013 and minor edit in 66eb596

Additional notes related to the request of an ablation study:

Looking at Figure 10: each one of the runs took in the order of 30h to run in 1 GPU. In the whole paper, we already report on work involving a few dozen such runs, over months. (n the paper, when you see a run, we had to run several more to check, confirm, or try things out. In an ablation study (or "reverse ablation") we would need to tray many tweaks and run many cases. This also requires code modifications, writing new code with Modulus, which has to e tests and verified. We estimate this might take several months to complete (especially under the requirements of strict reproducibility that we operate in). It could in fact be a whole new paper. It would also require securing computational resources accordingly—for this study, we used an NVIDIA cluster that we no longer have access to.

Moreover, the goal of the paper is not to "fix" the PINN method, but to show that it can fail, and try to understand why it might fail (we partially arrive at an answer, and we offer some hypothesis from our analysis). In the dissertation by Pi-Yueh Chuang, several experiments tried a variety of things: different NN hyper parameters, number of neurons per layer, number of layers, also tied different weighting of loss terms, and adaptive weighting (annealing), also different learning rate scheduling and stochastic weight averaging. None of this helped in either accuracy or performance.

barbagroup / jcs_paper_pinn

Reviewer 2 comments #4

Exported to text (mangled equations)

Reply

Reply to the last point:

Reply to the ablation study:

Reply:

Reply:

Reply:

Reply:

Reply:

Reply:

Reply: