JacquesCarette / Drasil

Generate all the things (focusing on research software)
https://jacquescarette.github.io/Drasil
BSD 2-Clause "Simplified" License
141 stars 26 forks source link

Implementing `setUnion` in C++. #3970

Open Xinlu-Y opened 3 weeks ago

Xinlu-Y commented 3 weeks ago

https://github.com/JacquesCarette/Drasil/blob/42e4ca133694e80256d42ee388aa19cfaa17cbfb/code/drasil-gool/lib/Drasil/GOOL/LanguageRenderer/CppRenderer.hs#L1390-L1394 At the moment, setUnion is not implemented in C++ yet. The standard way to compute a set union is via std::set_union, but this function does not modify sets in-place. Instead, it computes the union of two sets and stores the result in a third set.

Is there a recommended way in C++ to perform an in-place set union similar to Julia's union! or Java's addAll? I found that using the insert method (e.g., a.insert(b.begin(), b.end());) can achieve an in-place union by adding elements from one set to another directly. To implement this, I need to iterate over the second set and insert each element into the first set. How can I achieve this iteration and insertion using a loop, such as ForEach, to go from b.begin() to b.end() and add the elements into set a?

JacquesCarette commented 2 weeks ago

I don't think it is a strong requirement to do set union in-place. I think std::set_union is just fine, and can be replaced at a later time with more imperative code if the need truly arises.

balacij commented 2 weeks ago

Aside: out of curiosity, why does the definition of contains refer to something called containsInt? Are we currently only able to handle membership checking for integers?

C++ std::set_union has an interesting definition. Some notes:

  1. it expects that we give it two iterators for each set: one iterator starting from the first element (i.e., begin), and another that starts just past the last element (i.e., end),
  2. the return value is an output iterator that points an imaginary element past the last element in the new set,
  3. it expects an 'output iterator' (i.e., the data writer). I assume we'll need to use the same one for a set, specifically a std::inserter because that one preserves order of the set (note: sets in C++ are ordered). However, in order for us build an inserter, we need references to the set that we're inserting into. So, it looks like we need to have a variable pre-defined for each set union, so that we can insert into it. (Am I getting that right, @Xinlu-Y ?)

Some comments on each:

  1. The syntax is going to be noisy.
  2. We will struggle to generate code for expressions like $e \in (P \cup Q)$ because the hypothetical 'return' of $P \cup Q$ (as translated to set_union) will not be the set itself, but an iterator that points to one past the last element in the set.
  3. Again, the syntax is going to be noisy.

I think that (2) is a fundamental conflict with how we generate code. To generate code for $e \in (P \cup Q)$, we would need to move the entirety of $P \cup Q$ into an earlier expression that creates a temporary result set, performs the set union, and then we can continue to $e \in s$. It would mean that for each set union operation we have in an expression, we need to create a variable above wherever that expression is. That does not quite seem human-readable/traceable anymore. So, maybe we can write a little function that helps us polyfill the function with the right syntax we want? e.g.,

template <typename T>
std::set<T> union_sets(const std::set<T>& s1, const std::set<T>& s2) {
    std::set<T> result;
    std::set_union(s1.begin(), s1.end(), s2.begin(), s2.end(),
                   std::inserter(result, result.begin()));
    return result;
}

This function will allow us to get rid of some of the syntactic noise of the generated code. Furthermore, I noticed that we we use iterators for set containment as well (i.e., not very human-readable for our purposes), I think we can do the same for that as well, e.g.,:

template <typename T>
bool set_contains(const std::set<T>& s, T e) {
    return s.find(e) != s.end();
}

Wrapping everything together into a complete snippet:

#include <iostream>
#include <set>
#include <algorithm>
#include <iterator>

template <typename T>
std::set<T> union_sets(const std::set<T>& s1, const std::set<T>& s2) {
    std::set<T> result;
    std::set_union(s1.begin(), s1.end(), s2.begin(), s2.end(),
                   std::inserter(result, result.begin()));
    return result;
}

template <typename T>
bool set_contains(const std::set<T>& s, T e) {
    return s.find(e) != s.end();
}

int main() {
    std::set<int> s1 = {1, 2, 3, 4, 5};
    std::set<int> s2 = {4, 5, 6, 7, 8};

    int e = 5;

    if (set_contains(union_sets(s1, s2), e)) {
        std::cout << "Yes" << std::endl;
    } else {
        std::cout << "No" << std::endl;
    }

    return 0;
}

I think that this is more representative of the kind of code we want to generate. That being said, I'm taking a very biased stance in what kind of semantics I'm assuming we want out of the set API in GOOL. So, my question is: what do we want out of the set API in GOOL? What should these functions translate to in each of the languages, semantically?

For example, "set addition" is a rather uncommon thing (at least in my experience...) -- I'm not quite sure how we would even go about writing that, we usually would see something like $S \cup \set{e}$ (i.e., set union). Similarly "set removal" is uncommon too -- one would normally see $S \ \set{e}$ (i.e., set difference). For Expr and ModelExpr, I'm not sure if we should even have these operators. "Set addition" does make sense on the side of GOOL, however, because that usually relates to an imperative-statement, likely a 'void' or 'boolean' function/method, unlike in mathematics where the result of the operation would be another set. For example, in Java, set add returns a boolean, C# similarly, while in Python, s.add(x) returns nothing (None, technically).

We have a pipeline of expressions: Expr -> CodeExpr -> GOOL (/GProc now?). Now we need to figure out what 'set' means to each of these stages and what kinds of operations belong to each (and what those operations do).

Xinlu-Y commented 1 week ago

The function containsInt does indeed implement the contains functionality for sets in C++ using the find and end methods. It supports not only the Integer type but also other types that GOOL supports. Therefore, I believe it is just a naming typo. https://github.com/JacquesCarette/Drasil/blob/76ce3095201af24ad47292731f6033546376baeb/code/drasil-gool/lib/Drasil/GOOL/LanguageRenderer/CommonPseudoOO.hs#L148-L149 Currently, our basic set tests do not include functionality to test union operations. However, to ensure consistency across different languages, the union function in C++ should be translated in a way that aligns with its implementation in other languages.

Given this, to maintain semantic consistency across different languages, I think we need to implement a similar imperative-style union function for C++ as well. This function should behave like its counterparts in other languages, allowing direct modification rather than requiring the use of iterators or creating additional intermediate sets.

If we implement custom functions for set operations in C++, we align the behavior with that of other languages like C#, Java, Python, etc. This means that when we generate code for C++, these tests will work correctly just as they do in other languages, providing reliable and readable code.

  setDecDef (var "s" (setType int)) mainFn (litSet int [litInt 4, litInt 7, litInt 5]),
  valStmt (setAdd (valueOf (var "s" (setType int))) (litInt 10)),
  assert (contains (valueOf (var "s" (setType int))) (litInt 10))
    (litString "Set s should contain 10 after setAdd")
  setDecDef (var "t" (setType int)) mainFn (litSet int [litInt 3, litInt 5]),
  setDecDef (var "unionSet" (setType int)) mainFn (setUnion (valueOf (var "s" (setType int))) (valueOf (var "t" (setType int)))),
  assert (contains (valueOf (var "unionSet" (setType int))) (litInt 10))
    (litString "Set unionSet should contain 10 from set s"),
balacij commented 1 week ago

Great!

Thank you! This clarifies a lot for me!

In that case, I think that Java deviates from the rest because addAll is a boolean-returning method, while the others return a reference to a new set (right?).

I think that generating these polyfill functions is the right way to go to keep the code looking the same across generated code bases. @JacquesCarette @smiths Do you have any thoughts on this?