Add parallel-letter-frequency

ahans commented 5 months ago

Here's a first version of this exercise to get the discussion started.

The C++-specific parts of the description are still missing. I wanted to wait with writing that until we've reached consensus on the approach.

Since the introduction of the parallel versions of STL algorithms with C++17, I think using that is the most elegant way of solving this exercise. However, to make that work with gcc, TBB needs to be present (without TBB it will just fall back to sequential execution). I added the respective find_package line to CMakeLists.txt, but I'm not sure if that is an acceptable prerequisite here. Using std::thread manually would be the obvious alternative. However, due to the overhead of thread creation, that would only be beneficial for huge texts and in practice one would probably rely on thread pools and avoid the creation of new threads with each call to the function.

I also included a Catch2 benchmark that can be turned on with a define. I have seen Google benchmark used elsewhere. I could switch the benchmark to use that, but I think that since we have Catch2 already available anyway, it makes sense to use its built-in benchmarking tool.

I also thought about how to automatically check that things actually do run in parallel. However, I think there's no robust (let alone portable) way of doing that. So just like the other tracks, we'd have to trust the student implements something that is actually parallel. The unit tests can only check for correct results.

Let me know what you think!

github-actions[bot] commented 5 months ago

Hello. Thanks for opening a PR on Exercism 🙂

We ask that all changes to Exercism are discussed on our Community Forum before being opened on GitHub. To enforce this, we automatically close all PRs that are submitted. That doesn't mean your PR is rejected but that we want the initial discussion about it to happen on our forum where a wide range of key contributors across the Exercism ecosystem can weigh in.

You can use this link to copy this into a new topic on the forum. If we decide the PR is appropriate, we'll reopen it and continue with it, so please don't delete your local branch.

If you're interested in learning more about this auto-responder, please read this blog post.

Note: If this PR has been pre-approved, please link back to this PR on the forum thread and a maintainer or staff member will reopen it.

vaeng commented 5 months ago

Hey there,

thank you for the PR. Lots of good ideas.

You raise a lot of valid points. I think this would be best done in an approach and an article. We have 20 seconds per execution on the test-runner. This includes spinning up the docker container, compiling the code and the tests and converting the results to json.

I would rather not run benchmarks in that timeframe. Currently, with test runner v3 there is no interface to display timing information back to the user.

All tests use Catch2 v3 on the test-runner. So if we would switch to gtests or something else, we would have to change the test-runner, which I would rather not do for a single exercise.

As with many other exercises, we cannot force students to do a "correct" implementation, if the incorrect (non-parallel) thing has the same results for the unit tests.

ahans commented 5 months ago

You raise a lot of valid points. I think this would be best done in an approach and an article. We have 20 seconds per execution on the test-runner. This includes spinning up the docker container, compiling the code and the tests and converting the results to json.

I didn't mean to run the benchmark as part of the normal online evaluation. It's supposed to be something a student can use offline. The same approach is followed on (I think) the Rust track. The question was more about which benchmarking framework we want to use. I would prefer Catch2's, given that Catch2 is already in place anyway, but I have seen Google Benchmark being used as well, so I can imagine that for consistency we'd want to go with that.

The other question is about TBB. Would it be acceptable to include that in the runner image? The STL algorithms-based approach would also work without it, but then the runner would definitely execute things sequentially and not in parallel. We could also make the requirement optional and explain in an article/approach what students would have to do to run it locally with parallelism enabled.

ahans commented 4 months ago

Some guidance on how to continue here would be appreciated!

vaeng commented 4 months ago

I made a test runner and modified the test-runner to use tbb. It is just 3MB more than the current runner image, so I think it is okay to "upgrade".

Apart from the test-runner I would also need to update the github actions to accept tbb solutions.

We would definitely need a [.docs/introduction.append.md file ]/https://exercism.org/docs/building/tracks/practice-exercises#h-documentation-files) to explain the routine for the benchmark and also on how to install tbb for ubuntu, mac and windows and maybe how to deactivate that part of the testing locally if that is somehow not possible.

Can you add such a file?

ahans commented 4 months ago

I think I addressed most comments. The introduction.append.md file needs to be filled, I can certainly add something there.

Regarding the benchmark, I changed a few more things around:

There is a new CMake flag EXERCISM_INCLUDE_BENCHMARK now. When it is set, the relevant code gets included and the required Catch2 define is set.
The benchmark is in its own test now.
Texts are random generated and a bit larger, so that you can actually see a runtime difference between sequential/parallel execution.

I made TBB now optional. Things work without TBB available, it's just that execution won't be parallel then. I also pulled that into the GCC/Clang block. I believe that MSVC uses its own implementation and doesn't require TBB. I have some Windows machine with MSVC available I can use to verify this.

ahans commented 4 months ago

I made a test runner and modified the test-runner to use tbb. It is just 3MB more than the current runner image, so I think it is okay to "upgrade".

Will you open a separate PR for those or do you want me to do it? I suppose that would go into the cpp-test-runner repo?

Now with TBB being optional I think it'd be also fine to just skip that. The test runner doesn't need to actually run things in parallel IMHO.

Apart from the test-runner I would also need to update the github actions to accept tbb solutions.

Same question as above. With TBB optional, do we need that?

One argument could be that we also want to allow students to use TBB directly. I'm not sure about that. For the C++17 parallel algorithms, TBB's presence is an implementation detail (and probably not even for all platforms), so that alone would not be a good reason to expose it. Would we want students more than what standard C++ has to offer?

ahans commented 4 months ago

We would definitely need a [.docs/introduction.append.md file ]/https://exercism.org/docs/building/tracks/practice-exercises#h-documentation-files) to explain the routine for the benchmark and also on how to install tbb for ubuntu, mac and windows and maybe how to deactivate that part of the testing locally if that is somehow not possible.

Can you add such a file?

Looking at the linked documentation, I think you meant instructions.append.md, right?

vaeng commented 4 months ago

I like the optional route you have taken. This way we can keep everything as it is, but might still be adding tdd to the test-runner in the feature.

We should make use of the append file to explain the reasons for tbb and how it can be used in this context. This is also the correct file to explain the benchmarking part and how to activate it.

ahans commented 4 months ago

We should make use of the append file to explain the reasons for tbb and how it can be used in this context. This is also the correct file to explain the benchmarking part and how to activate it.

Sorry for the silence for so long.

I added that file now. I also removed the size of the texts for the benchmark. When testing under Windows it was running for quite a long time. 10 KiB texts also show a speedup.

Let me know if there's more you'd like changed.

ahans commented 4 months ago

Every new sentence should be on a new line in the append document.

I updated the document now accordingly. But I wonder what's the rational for this?

vaeng commented 4 months ago

Every new sentence should be on a new line in the append document.

I updated the document now accordingly. But I wonder what's the rational for this?

https://exercism.org/docs/building/markdown/markdown#h-one-sentence-per-line

ahans commented 4 months ago

Every new sentence should be on a new line in the append document.

I updated the document now accordingly. But I wonder what's the rational for this?

https://exercism.org/docs/building/markdown/markdown#h-one-sentence-per-line

Interesting points! Thanks for that! I actually had line breaks in between sentences. I think it doesn't conflict with the points of your link and makes it easier to read in a plain editor that doesn't support markdown rendering. But it's apparently not done in other markdown documents around here. So now mine also has a strict 1-1 relationship between sentences and lines.

ahans commented 4 months ago

Thanks for the changes.

What can we do about the macOS issue?

I only noticed this now. I guess #ifdefing it out is the only reasonable way here. In the doc I even say that Apple Clang doesn't support it. Looking at that table, I also realize that what I write about gcc/clang may only be partly true. Looks like TBB is used by libstdc++, but not libc++, where it's still an experimental feature. I will try to address the CI build issue as well as update the doc.

exercism / cpp

Add parallel-letter-frequency #800