Should we benchmark languages other than Fortran, why, and how?

milancurcic commented 3 years ago

I see great value in implementing a variety of simple yet real-world algorithms in Fortran and benchmarking them along multiple axes:

Different problem sizes (e.g. array or matrix size)
DIfferent compilers
Different optimization flags
Different hardware

How about different languages? What would be the main purpose of that?

Are we interested in comparing the performance of Fortran and other language implementations, using idiomatic, naive code (i.e. the code that a novice would write), and thus comparing the compilers capability to optimize?

Or are we interested in writing code in different languages that produces the same (or as similar as possible) assembly, and then compare the source code?

awvwgk commented 3 years ago

I would consider this benchmark repository also as an example repository to learn about possible usage of the language to write performant code. Having different languages in the benchmarks could serve as kind of Rosetta stone for users proficient in optimizing code in another language trying to learn about optimization in Fortran.

certik commented 3 years ago

Copying my comment from #9.

We discussed quite a few times, e.g. in #2, to include other languages.

Why: to actually have a comparison across languages. I expect that across Fortran/C/C++ perhaps even Julia one can fiddle with the code to eventually get similar speed. And we should have that final code. But also what would be interesting (for me) would be code that you would reasonably write as a domain expert, say a physicist. And we should have those codes too.

How: that has to be discussed, for now let's have it in some form, such as in #9. I expect we'll have benchmarks for smaller arrays, that have to be run many times. I have a runner for that somewhere, I'll see if I can contribute it. Then I expect to have longer running tests, such as #9, which are not as sensitive to how it is timed.

milancurcic commented 3 years ago

@certik I guess I'm asking for more clarity beyond what's been already discussed. I interepret this as, there will be at least two versions of each problem in each language:

A hand-optimized implementation which is the fastest one to produce the correct result (anything goes, as long as the result is correct). I think the purpose of this would be to compare and be able to say, what's the fastest possible you can get for a given problem, for a given language implementation (e.g. GFortran, LFortran, GCC, CLang, Julia, etc.).
A naive, idiomatic, simplest implementation. I think the purpose of this would be to compare how expressive different languages are (syntax, built-in functions, etc.). The comparison of the timing will also tell us something here, perhaps how good an implementation is at optimizing idiomatic code.

Is the above aligned with your view? Is there some other comparison that is missing, beside these two?

certik commented 3 years ago

I would expect to have many versions of the same benchmark. Besides the two you mentioned, also one that allows assembly intrinsics and one that does not. One that perhaps uses more array operations, one that does not. If you look at the benchmarks in https://benchmarksgame-team.pages.debian.net/benchmarksgame/, they have many versions for the same languauge. I would expect every time someone contributes an improvement, we can have a new version. Also we each have a different "taste" what constitutes "nice code", so making sure we each have a version that we like there would drive the point home I think. See also:

https://github.com/fortran-lang/benchmarks/issues/2#issuecomment-656223298

arunningcroc commented 3 years ago

I think having multiple languages is desirable, in as many languages as possible, so that the reader can compare both the verbosity and performance. For example, in #9 one surprising feature was that the C compiler does a relatively good job of optimizing (probably inlining?) function calls, whereas Fortran suffers from a bigger performance penalty in this case (probably it doesn't inline the function calls?) At any rate, such comparisons are interesting, at least to me. Perhaps we could define some set of languages as "minimal" and strive to have each benchmark have implementations in at least those languages. For example, C/C++, Fortran, Python and Julia could be candidates, since they are all popularly used for scientific computing.

certik commented 3 years ago

@arunningcroc I agree 100%.

rouson commented 3 years ago

Modern languages have diverged to such a great extent in philosophy and design that direct comparisons are exceedingly difficult, especially for comparisons based on performance.

Parallelism

Fortran is an inherently parallel language. For a direct language-to-language comparison, you will have to handicap Fortran by comparing only serial code, leaving a ton of performance on the table even on modest platforms (e.g., an 8-core laptop), which defeats the purpose of a performance comparison. Or you can compare parallel codes, but then the comparison is between Fortran and Language X + Parallel Programming Model Y.

Vectorization, Multithreading, and GPU Offloading

The Fortran committee intended for do concurrent to support such technologies as vectorization, multithreading, and GPU offloading. When the NVIDIA compiler offloads do concurrent to a GPU, the Cray compiler vectorizes or multithreads do concurrent, and gfortran generates SIMD instructions from do concurrent, the compilers aren't just generously giving us a bonus. They are fulfilling the language designers' wishes. I don't know that there are constructs in the other languages that were designed with similar intent. With the other languages, you'll have to insert OpenMP, OpenACC, CUDA or other compiler-specific statements to achieve what you can do with standard Fortran without bolting anything on the side. Then the comparison is really between Fortran and Language X + Compiler Technology Y.

Other implicitly parallel Fortran features include array statements, elemental procedures, and pure procedures invoked inside do concurrent. I'm not at all clear that there are comparable constructs in other languages and I would hope that the features I'm naming would be exploited to maximal effect in anything that would be labeled as idiomatic modern Fortran.

Libraries

Idiomatic C++ and Python typically rely upon external libraries even for such basic things as multidimensional array functionality. A C++ programmer wanting performance portability across a range of heterogeneous hardware architectures, for example, is likely to hand off as much performance-critical computation as possible to a library like kokkos. If you write self-contained C++ without exploiting such libraries, that alone might disqualify the code from being idiomatic C++. If instead you incorporate libraries, then you're comparing Fortran to Language X + Library Y. It seems unlikely that the proposed effort will match the performance of highly optimized libraries.

Generic Programming

I suspect that any modern C++ library is by definition generic (using templates), but Fortran's generic programming features are still under development, making an apples-to-apples comparison impossible without a ton of clunky include statements, conditional compilation, etc.

Bottom Line

Let's all write code in the languages that make us feel most productive while providing acceptable performance. Fortunately, there could be an ultimate convergence as the language developers learn from each other. Fortran 202Y will support generic programming. C++23 will support multidimensional arrays. I'm not sure any of the languages named will ever support parallel programming in the way that Fortran does, however, because too small a sliver of the languages' programmers require scalable parallelism for the languages to add something that so fundamentally changes the language's execution model. Julia is an exception to the latter statement because its designers are targeting high-performance computing.

Lastly, performance analysis and tuning is a challenging research topic in its own right. It's probably not a great idea to wade into this area unless one is taking it on as a subject of research and planning to dive deeply into it.

certik commented 3 years ago

@rouson, @milancurcic I see you are both a little bit reserved about the purpose of the benchmarks repository. It seems your arguments show the dangers of how we can fail. Yes, I agree that those dangers are real. But what do you suggest we do? Do you suggest we abandon the benchmark effort? And if not, what scope do you see we do?

I still think we should benchmark against other languages. We should benchmark in parallel and compare against C++ and Kokkos, among other things. As a user that is exactly what I would like to see.

rouson commented 3 years ago

@certik my "Bottom Line" section accurately summarizes where I stand. I worry that raw performance comparisons between the languages distract from other important considerations such as programmer productivity and fundamental differences in the languages' design philosophy. I also worry that the comparisons could devolve into debates about whether what is written is truly idiomatic in the given languages if the focus is performance. Most importantly, I worry about what happens when the goals of writing high-performing code and idiomatic code diverge. I cringe when I see deeply nested do loops doing what an array statement or elemental procedure call or do concurrent could do, but I also recognize that the nested-do form might be required in performance-critical code with specific compilers. There's a great deal of variation in what the compilers can do to exploit some of the features that make Fortran shine in its ability to express mathematical formulas compactly and clearly.

In addition to all the other features that I've mentioned so far, I would hope that idiomatic Fortran would make extensive use of intrinsic functions, which also can replace multiple lines of custom logic, but the performance of intrinsic functions is another area in which I would expect considerable variation across compilers, compiler versions, and compiler flags. Consider, for example, that some compilers can be directed to call some user-chosen optimized BLAS implementation top support matmul. This brings me to another interesting language difference. There's a proposal to bring BLAS-like functions Into the C++ standard to what I suspect is an even greater extent than Fortran currently supports via intrinsic procedures. This too will greatly complicate comparisons and require that whomever is writing the C++ code is up for keeping up with the rapid pace of change in C++, which is already a really big language.

arunningcroc commented 3 years ago

@rouson I think objectively you're quite right that comparisons between the performance of languages frequently make little sense. Nevertheless, people are interested in them, as can be seen from the continued popularity of websites like the benchmarks game, and the recent discussions in the discourse forum surrounding e.g. Julia. Julia, for instance, includes benchmarks quite prominently on its page. I know I certainly sometimes look for benchmarks when evaluating languages, with the full understanding that this can't give me a complete picture.

I don't think a benchmark or any other marker of language performance has to be 100% fair and perfect to be useful or just plain fun to look at. Indeed, I think Julia's performance page says it best:

These micro-benchmarks, while not comprehensive, do test compiler performance on a range of common code patterns, such as function calls, string parsing, sorting, numerical loops, random number generation, recursion, and array operations.

It is important to note that the benchmark codes are not written for absolute maximal performance (the fastest code to compute recursion_fibonacci(20) is the constant literal 6765). Instead, the benchmarks are written to test the performance of identical algorithms and code patterns implemented in each language. For example, the Fibonacci benchmarks all use the same (inefficient) doubly-recursive algorithm, and the pi summation benchmarks use the same for-loop. The “algorithm” for matrix multiplication is to call the most obvious built-in/standard random-number and matmul routines (or to directly call BLAS if the language does not provide a high-level matmul), except where a matmul/BLAS call is not possible (such as in JavaScript).

rouson commented 3 years ago

@certik If performance comparisons are the primary aim, then I recommend contributing to Jeff Hammond's Parallel Research Kernels (PRK). Parallelism is an essential ingredient in any performance discussion in the multicore/manycore/GPU era. PRK contains Fortran, C++11, Julia, Python, Ruby, UPC, and more. I would love to contribute to refactoring some of the Fortran code to be more idiomatic along the lines of what I wrote above. For example, PRK's Fortran kernels contain 11 different matrix-tranpose implementations. In my view, idiomatic Fortran simply calls the transpose intrinsic function unless there is a strong motivation for doing otherwise and that motivation is likely to be compiler-, platform-, and problem-specific. I see that at least one of the PRK transpose implementations calls the transpose intrinsic function. I would argue that most other implementations are not idiomatic Fortran and I'd only use another implementation if I had solid, quantitative evidence that (1) the performance is significantly better for the more verbose transpose and (2) the code in question is the critical bottleneck in the application.

rouson commented 3 years ago

@arunningcroc I have to admit that the whole time I've been writing responses, I've been thinking it would be fun to look at the proposed code. :) From that perspective, it's definitely a valuable exercise. I like the analogy someone made to the result being a Rosetta Stone.

rouson commented 3 years ago

@certik @arunningcroc @milancurcic having now looked at the first example, I feel even more strongly that this effort could do more harm than good. First, I urge you to not call the repository "benchmarks." If you do that, you're diving into a field with a long and controversial history. Consider the following language near the top of the README for PRK repository, a multi-language comparison effort very similar to yours except for the parallelism:

"These programs should not be used as benchmarks. They are operations to explore features of a hardware platform, but they do not define fixed problems that can be used to rank systems. Furthermore they have not been optimimzed for the features of any particular system."

You could adopt similar language, replacing "hardware platform" with languages.

rouson commented 3 years ago

Second, @certik, I strongly disagree with the idea of launching a new project to write "code that you would reasonably write as a domain expert, say a physicist" unless you're going to have comparison code to demonstrate much more modern practices. The majority of domain experts are writing a narrow subset of a 31-year-old version of Fortran. As mentioned above, I cringe every time I see nested do loops doing what an array statement could do as I suspect is the case twice in just the 55-line Poisson solver. I call such code Cortran: it's the Fortran program that a C programmer would write because they think they have to loop over all the elements of an array just to initialize the array.

Moreover, the program uses fixed-size arrays despite allocatable arrays being one of the features that most makes Fortran shine -- not just because of dynamic memory allocation but also (especially) because of automatic deallocation. Let's hope no domain expert is still using fixed-size arrays!

Even the "optimized.f90" Poisson solver appears at first glance to be standard-conforming Fortran 90. In fact, with some reformatting, much of it would be standard-conforming Fortran 77. If this effort moves forward, I hope that every code will have a "modern.f90" comparison that separates interfaces from implementations, uses array statements and intrinsic procedures wherever possible, and uses pure and elemental procedures wherever possible. Every feature that I just cited was in Fortran 95 so this shouldn't even be considered a big step in 2021, but keep in mind that each of these features has potential performance implications and most of them should be positive performance implications with a sufficiently advanced compiler. There are numerous optimizations that pure facilitates, for example. A great service to the community would be to explore the performance implications of a much more modern approach and contrast it with more archaic approaches even just within Fortran. I would find that a lot more useful than the comparisons to other languages.

Moving into the 21st century, I would hope that any effort to write a substantial amount of new code would decompose the problem into procedures and then separate the procedure interfaces (in modules) from procedure definitions (in submodules), and use Fortran's facilities for parallelism and concurrency. With the exception of submodules, every feature I've named has potential runtime performance implications -- usually positive implications with a sufficiently advanced compiler -- so using these features fits perfectly with the goals of the repository. And even submodules have potentially positive compile-time performance implications, which might matter less for small kernels, but having procedure interfaces sure makes for a nice introduction to the high-level goals of different parts of the program as expressed.

certik commented 3 years ago

@rouson, thanks for the feedback. I think your concerns can be alleviated:

Idiomatic vs most performing: I would like to have both. I expect to have 10 different Fortran versions for a given problem. Some of them will get the top performance, showing that it can be done; but perhaps it would not be code that I would like to actually use. Not idiomatic. Also: I would like compilers to improve performance of the "idiomatic" code. So just the fact that today some version is the fastest does not mean much: as compilers improve, this will change.
Parallel vs serial: I want both, serial performance is important and easier to reason about. But parallel performance is ultimately what matters for a lot of HPC codes (depending on how it is parallelized).
Jeff Hammond's Parallel Research Kernels (PRK): yes, the PRK is a significant subset of what I would like to see, but I see the scope of this repository much bigger than that.

Whether we like it or not, people will keep doing such comparisons and posting online. Such comparisons influence people's choices. I know from personal interaction that people watch the "benchmarks game" site.

We don't have to call the repository "benchmarks". We can call it "Rosetta". Or we can call it "how to solve a given problem in Fortran and other languages", so that people can learn what the options are: how to write idiomatic code, how to write the "simplest" code. How to write high performing code. And what is (currently) the top performing code+compiler+platform combination.

I think the harm comes from drawing (wrong) conclusions. But having codes that solve a problem can't harm. In fact, we have different opinions what "idiomatic" means in Fortran. For example I do not like object oriented for numerics. I know others do. So we should have both. A third person comes an says "I don't like either of these!", so they can write a third version that they think is the best and we should include it too.

Then we should have automatic tooling that can compile all such codes with different compilers and compiler options and time it. We should present it in a nice way, so that you can find code version that you personally like the most, and you can see how it stacks using current compilers. Then we should have a conversation how to improve compilers (or is it even possible for a given version). If it is not possible, then maybe that should not be the "idiomatic" way to write such code. And so on.

certik commented 3 years ago

Regarding your comment at https://github.com/fortran-lang/benchmarks/issues/10#issuecomment-869787773, what you described should absolutely be one version of a Fortran code that solves a given problem. Depending on the problem, I might agree or I might not, I would have to see the code. If I do not agree, then I can submit another version. I expect we will have 10 versions easily. Then we can see them side by side, see how they perform, see how easy they are to read, to maintain, etc.

A physicist can absolutely learn how to program in modern style. That is what I meant. But it needs to be simple to learn.

rouson commented 3 years ago

@certik I agree with everything you wrote and if I can do it quickly, I'll contribute one modern.f90 companion to the poisson2d subdirectory. It might also be nice to have a poisson3d version to nudge things a bit closer to the kind of problem likely to appear in applications and the kind of problem for which performance matters more.

While we're at it, I wonder if there should be Matlab versions or some open-source Matlab equivalent such as GNU Octave. A surprising amount of real science happens in Matlab and the performance differences can be even more significant relative to Fortran than with the other languages mentioned so far.

certik commented 3 years ago

@rouson yes, I am thinking the initial set of languages could be Fortran, C++, Python/NumPy, Julia and Matlab/Octave.

certik commented 3 years ago

I opened #22 just for the name change. Thanks @rouson for all the feedback. I feel good about this. I also opened tons of new issues (#12 - #21) for some ideas for benchmarks that we can add.

arunningcroc commented 3 years ago

@rouson Just for the record, many versions of the Fortran codes were proposed in the Discourse topic, and I only included the ones that ran fastest on my machine. That includes a vectorized version. As for allocatable arrays, I don't really understand why that would be more modern. I'm after all dealing with a fixed size calculation. Any pointers on that? Anyway, I hope you contribute a modern.f90 version as well, and if you do, I can also run that on my machine with the same settings, then we can get the timings on it.

I also welcome the name change, but I'm dubious that any code we post just to compare similar algorithms in different languages could really do a lot of harm. We could call the repository "the instructions for world domination", and it would not change the content one bit. I think some trust in the intelligence of the reader is warranted here.

certik commented 3 years ago

@arunningcroc I personally like fixed size arrays as well, as they are also automatically deallocated (just like allocatable arrays) and for a simple example like you did I think they are a perfect fit. But as I mentioned in https://github.com/fortran-lang/benchmarks/issues/22#issuecomment-870639621, I don't want to argue about this now what the "modern idiomatic" style is. For now I just want to have all approaches in, and we should have that discussion later.

awvwgk commented 3 years ago

I followed the discussion here and related threads a bit and I'm somewhat disappointed about the overall tone. I know that coding style can be a somewhat loaded topic, but let's not judge each other by the way we write small code examples, please.

Let's keep this a place for respectful collaboration such that we all can enjoy working on this project together.

rouson commented 3 years ago

@awvwgk apologies for any judgement. I do worry that a large part of what turns so many people away from Fortran is the older Fortran that they've seen. If we're still writing code that is effectively Fortran 77 plus a tiny subset of Fortran 90, it's going to be very hard to attract new people to the language. Moreover, it's worth noting precisely what some of the older constructs communicate to the reader and to the compiler. A do loop, for example, tells the reader and the compiler that a certain set of steps are to be one in a certain order. An optimizing compiler will nonetheless do its best to determine whether that order really matters and will violate the order in the name of performance. Likewise, a patient reader will step through the logic mentally and often figure out that the order doesn't matter, but that's extra work and extra screen real estate. I would still argue that we can do better by the compiler and by the reader and that there is inherent value in not serializing operations unnecessarily. I hope this way of communicating it is more factual than judgmental.

milancurcic commented 3 years ago

I am not against benchmarking implementations in different languages. My reservation is the same as with any other Fortran-lang project (stdlib, fpm, website etc.): Let's build things mindfully and with intention rather than just throwing things in there and seeing what happens.

Now, I understand and recognize that nobody here argued that we should just throw things in and see what happens. But, I also haven't seen a clear goal on what exactly we want to compare between the languages. Take for example #9: There, I'd like to have seen documented (in the README perhaps):

What is being measured and compared, exactly? My interpretation of that PR is that the goal is to compare the ability of compilers and interpreters to generate performant code from two variants of source code.
If so, are we sure that the sources between different languages are semantically equivalent? Can we say the same about the run-time libraries used? If they're different, we should document that and say that we're comparing the performance of idiomatic code that is not supposed to be semantically equivalent.
How were different programs compiled? Fortran and C versions are instructed to be compiled with -Ofast, but was NumPy compiled with -Ofast for the posted Python results? If NumPy was compiled with -O2, it may be more meaningful to use the same for Fortran and C versions.

My point is, if you ask and answer the question what exactly you're trying to compare, you'll have a better chance to make a meaningful (fair) comparison.

Ultimately, I want to avoid a benchmarks repository where Fortran implementations are inadvertently fine-tuned to demonstrate or even imply language superiority. As this is a common criticism of other benchmarks like those on the Julia website and some recent blog posts, I expect that we wouldn't want to repeat the same, with tables turned. All of us here are responsible for preventing language wars from happening.

certik commented 3 years ago

It's a chicken and egg problem: we can't design and show what we are trying to do without first having a few benchmarks in, but we can't get the few benchmarks in because we do not have criteria to judge them.

I think we all understand the dangers of mindlessly benchmarking. Also nobody wants fine tuned benchmarks here (as the only thing or the main thing). I think we have all explained enough what we do not want. So let's now discuss what we want.

I proposed a vision in #22.

That vision answers your questions:

What is being measured and compared, exactly? My interpretation of that PR is that the goal is to compare the ability of compilers and interpreters to generate performant code from two variants of source code.

Nothing is being directly measured. This is an example ("idiom") how to solve a 2D Poisson equation with certain boundary conditions using a first order finite difference scheme and relaxation method. We already have two such examples contributed. I would like to see even more. Yes, we are interested in timings and benchmarks for this too, as one of the many other criteria, such as readability, and how hard it is to write.

If so, are we sure that the sources between different languages are semantically equivalent? Can we say the same about the run-time libraries used? If they're different, we should document that and say that we're comparing the performance of idiomatic code that is not supposed to be semantically equivalent.

As proposed in #11, we need tests to ensure any submitted example / idiom returns exactly the same answer. Regarding libraries to use, there will be 10 codes in C++ let's say, so some can use other libraries, some might only use "pure C++". We can look at timings and other pros and cons and compare and everybody can make their own opinion which one is better.

How were different programs compiled? Fortran and C versions are instructed to be compiled with -Ofast, but was NumPy compiled with -Ofast for the posted Python results? If NumPy was compiled with -O2, it may be more meaningful to use the same for Fortran and C versions.

With as many compilers / options that our infrastructure allows. NumPy can be installed using Conda (probably how a lot of people would install it), so a version number should be enough to identify. This has been discussed previously: https://github.com/fortran-lang/benchmarks/issues/2#issuecomment-656232163

@milancurcic and others let me know if you agree / disagree with the vision I presented. If it is too early to tell, then let's simply at least try, and if it is not going in a direction we want, we can always remove this repository from fortran-lang later. If you have a different vision for what this particular effort could become, then please share it.

milancurcic commented 3 years ago

I proposed a vision in #22.

The vision is great, I like it a lot. In this issue we're discussing last bullet point specifically.

It's a chicken and egg problem: we can't design and show what we are trying to do without first having a few benchmarks in, but we can't get the few benchmarks in because we do not have criteria to judge them.

Why can't we? I very much think we can. We just need to ask the question.

For the 2-D Poisson problem, for example, the question could be: How fast are the executables produced by Fortran and C compilers given idiomatic (no matter how we define this) and semantically equivalent Fortran and C code?

Is the above not a meaningful, interesting, and simple enough question to ask? I think it is, and it's something we can measure.

Then, if we like the question and agree to start there, contribute the Fortran and C implementations like those in #9. Meanwhile, ensure they're both correct and produce the same results (#11). Then, we can look and the timings. Now we have a minimal framework to expand upon, and add other languages to the mix. (I suggested Fortran and C to start because they have companion compilers, so it's likely easier to make a meaningful benchmark).

NumPy can be installed using Conda (probably how a lot of people would install it), so a version number should be enough to identify.

This is not what I meant. I meant: If NumPy is not compiled with -Ofast, don't compile Fortran and C code with -Ofast. If we established the minimal framework I suggested above, we wouldn't have this discrepancy.

I'm happy that the discourse in this thread is shifting from "we're doing benchmarks" to "we're doing idioms, and maybe we'll do some benchmarks", but it doesn't make for a fair discourse because I originally asked "Should we benchmark other languages", and not "Should we include other languages at all".

milancurcic commented 3 years ago

What I proposed above as the "minimal framework" or "the first question" is just my idea of where to start, I'm not sure that it's the best way to go and I don't have a lot of experience in that area.

Is there interest in discussing what would be the minimal framework to start with? In other words, what to compare and measure? I would like that.

Or would you prefer to just focus on idioms (source code) for now, and not worry about benchmarks until later?

milancurcic commented 3 years ago

I as well don't think of allocatable and static arrays as modern or archaic. They're just different. When I don't need an allocatable array, I don't use it.

However, we should plan ahead, to facilitate easier testing and timing, problem size, like M in #9, should be a CLI argument, and this will require the arrays to be run-time allocatable.

certik commented 3 years ago

Thanks @milancurcic, good points. Yes, if we can constructively figure out criteria, then by all means let's do that. I like what you started with the question. My comment would be that I think I want to answer more questions than just Fortran and C.

All I was saying is that I don't want this effort to die just because we do not have well fleshed out criteria.

The same with benchmarks. I very much want to focus on benchmarking. However, the objection was that only focusing on benchmarking would be harmful, so I am willing to let benchmarking go for now, and focus on idioms and code, and worry about benchmarking later.

In terms of what I care about long term, I would like to have the various maintained codes (in different languages) that solve the given problem (and where each of us can find that one version that we personally really like), and that could be benchmarked, and the infrastructure that allows this. For example, I want to have a NumPy version (or versions) that work. So that once I get to benchmarking, I don't have to worry about writing or debugging the codes, I already have them, and can concentrate on actually getting meaningful benchmarks out of it (as you said, then things should perhaps be compiled with the same options, although not in all cases, such as if you just want to compare the "default experience").

milancurcic commented 3 years ago

Great! I'd actually like benchmarks to move forward together with idioms, but with all these cautions that we discussed in mind. We have a good start with the 2-D Poisson, and I'd like us to work on it and polish it--I think mainly document it, ensure it's correct, and that it clearly conveys a message that we want to send. I think we'll have a minimal framework then to re-use for other problems.

milancurcic commented 3 years ago

I absolutely do not intend to kill this effort :).

Beliavsky commented 3 years ago

A source of cross-language benchmarks is the Benchmarks section of the Fortran code on GitHub list I am maintaining.

rouson commented 3 years ago

On Tue, Jun 29, 2021 at 6:49 PM Milan Curcic @.***> wrote:

For the 2-D Poisson problem, for example, the question could be: How fast are the executables produced by Fortran and C compilers given idiomatic (no matter how we define this) and semantically equivalent Fortran and C code?

Based on my extremely limited experience with performance analysis, I suspect this is a much, much deeper, thornier issue than one might immediately think and it's one that's likely to unnecessarily step on toes and lead to unproductive discussions. I'm not just saying this hypothetically. Without going into specifics, I'm watching these problems happen in real-time elsehwere right now. In that case, there exist some reasonably widely known codes that could easily be misinterpreted as answering exactly the question you're posing and yet experts in one the languages involved can immediately identify fundamental problems in the way that language's code is written that severely disadvantage it relative to the other languages. That's harmful. The harm wouldn't be a big deal if it weren't for the facts that (1) the people involved do not have the freedom to chase after every such occurrence and volunteer time to fix the problem because they have jobs with high-stakes deliverables, (2) the codes in question live for years with no one fixing the issues, and (3) anyone who naively encounters and runs the codes will reach conclusions that incorrectly shed a bad light on other people's life work in developing the languages and compilers and such.

Even setting aside the need for deep expertise in the language(s) in question, there are so many issues related to the interplay between problem size, hardware architecture (cache size, memory bandwidth, etc.), compiler choice, compiler version, compiler options, algorithm choice, etc. Computing enough cases to broadly range across the options along any one of the variables just named will take a considerable amount of time and still leave a high-dimensional parameter space to judiciously explore along all of the other axes. I recommend staying away from drawing any performance conclusions that apply beyond one person's particular choice for each of the aforementioned factors, and I doubt that my list was exhaustive. Moreover, most compilers these days are lowering any given language down to some language-independent, intermediate representation, at which point I assume that all languages are functionally equivalent anyway, which makes the whole exercise pointless to some extent.

And as I have pointed out at length, there are fundamental differences in the design of the languages that prevent such comparisons. Most of the languages we're discussing have no semantic equivalent to coarrays so, at best, you'll have to leave out one of the most important factors in performance (parallelization) to do any language-to-language performance comparison. That defeats the purpose.

I think the best purpose of this repository is to serve as a sort of Rosetta Stone for translating across various languages and across various idioms within a language.

Damian

milancurcic commented 3 years ago

Thanks @rouson, this is exactly the kind of perspective that we need. I didn't have sufficient foresight to recognize these issues when posing my question. I agree with pursuing only idioms and not benchmarks until we convince ourselves otherwise.

certik commented 3 years ago

I agree with pursuing only idioms and not benchmarks until we convince ourselves otherwise.

Thank you. Looks like we are in agreement now. I still want the timings though! Just not presented as benchmarks initially.

fortran-lang / benchmarks