Update the task template

R3BRootGroup / R3BRoot

Framework for Simulations and Data Analysis of R3B Experiment

https://github.com/R3BRootGroup/R3BRoot/wiki

GNU General Public License v3.0

18 stars 104 forks source link

Update the task template #894

Closed YanzhaoW closed 1 year ago

YanzhaoW commented 1 year ago

Update R3BRoot/template/NewTask/base with R3BIOConnector proposed in #893.

Please add the comments below if there are any suggestions. @klenze

Depends on #893

(Let's see how Github manages a PR based on another unmerged PR. ;D)

Checklist:

[x] Rebased against dev branch
[x] My name is in the resp. CONTRIBUTORS/AUTHORS file
[x] Followed the pull request guidlines and the Git workflow
[x] Followed the seven rules of great commit messages

klenze commented 1 year ago

I will look more into it later, for now let me cite the string_view documentation:

Unlike std::basic_string::data() and string literals, std::basic_string_view::data() returns a pointer to a buffer that is not necessarily null-terminated, for example a substring view (e.g. from remove_suffix). Therefore, it is typically a mistake to pass data() to a routine that takes just a const CharT* and expects a null-terminated string.

My take is to just use const std::string, which is a bit more expensive but provides the safe-ish (lifetimes!) method c_str().

YanzhaoW commented 1 year ago

Well, std::string is definitely a good option. But std::string_view provides much more convenience.

If we use std::string, the input has to be std::string:

auto task = Task(std::string{"a_task"});

On the other hand, if we use std::string_view, input can have any string-like type (without copying). So it still works with this:

auto task = Task("a_task");

Therefore, it is typically a mistake to pass data() to a routine that takes just a const CharT* and expects a null-terminated string.

I would say this is a very rare case since we don't usually create char arrays on purpose (I have to look it up how to do this). And if someone deliberately does this to shoot himself in his foot, then let it be.

klenze commented 1 year ago

First, the lack of the explicit keyword for the definition of the constructor (5) of basic_string seems to indicate that you can initialize a std::string using a zero terminated char pointer implicitly, just like you can for string_view.

Secondly, if an interface takes a std::string_view, that clearly suggests that it will work with any valid string_views. Implicitly relying on certain constraints which may be true for some instances (not per standard, but as implementation details) seems like a bad way to define an interface.

If your constructor requires a zero terminated string, then it would be better to call the argument const char* name /*must be a be zero terminated string*/. (Of course, this is also not great, but at least the user here is aware that they deal with C-style strings).

Explicit char array trickery is not required to construct string_views which violate your assumption of zero-terminatedness.

  std::string_view a{"12345"};
  a.remove_suffix(3);
  std::cout << a << "\n";
  printf("%s\n", a.data());

The cout will print "12", while the printf will print "12345", because the string referred to by a is not zero-terminated.

Is it terrible likely that someone will pass such a wrongly terminated std::string_view to that class? Certainly not. Still, especially given that this is code meant to be built on by new programmers, I think "it is fine to build hidden assumptions into your interfaces" is the wrong message to send, and nobody will ever care about the runtime and memory penalty of copying the name to a std::string in any case.

YanzhaoW commented 1 year ago

First, the lack of the explicit keyword for the definition of the constructor (5) of basic_string seems to indicate that you can initialize a std::string using a zero terminated char pointer implicitly, just like you can for string_view.

Yeah, you are quite right about this. Not sure how I got this wrong idea.

Secondly, if an interface takes a std::string_view, that clearly suggests that it will work with any valid string_views. Implicitly relying on certain constraints which may be true for some instances (not per standard, but as implementation details) seems like a bad way to define an interface.

I wouldn't say using std::string_view is relying on certain constraints. Rather it just says this interface accepts a reference to any string-like data and pass the reference to the underlying function. In this case, AnyTask just pass the reference of a task name to FairTask, no matter the string of the name is null terminated or not. If you want your string to be terminated, you should either choose your input string wisely or do some checking in FairTask. AnyTask, as a kind of forwarding interface to FairTask, shouldn't be concerned about this.

The cout will print "12", while the printf will print "12345", because the string referred to by a is not zero-terminated.

For a better comparison, you should also use a.data() for cout:

  std::string_view a{"12345"};
  a.remove_suffix(3);
  std::cout << a.data() << "\n";
  printf("%s\n", a.data());

Then you get same result. As the name suggested, std::string_view is just a view, which can't modify the underlying data it refers to. The default view of the string data excludes the null terminator. using remove_suffix makes the string_view referring to a "substring" of the char array.

char name[] = {'a', 'b', 'c', 'd', '\0'}; // size = 5
auto name_view = std::string_view(name); // size = 4
name_view.remove_suffix(2); // size = 2
name_view.data(); // const char*

I think "it is fine to build hidden assumptions into your interfaces" is the wrong message to send

I agree. But this is the responsibility of FairTask. AnyTask just pass this string through to FairTask (and it's preferred not to do copying).

nobody will ever care about the runtime and memory penalty of copying the name to a std::string in any case.

I would give a second thought about this. Since we are dealing with a template, which, presumably, needs to be viewed as a standard way to implement many task classes. If all of them have this string copying, it may have some impacts on performance.

klenze commented 1 year ago

Regarding interfaces, an analogy.

Party A hires party B to clean their silk shirt. After a few days, Party B brings back a ruined shirt. "Well, you see, I always outsource the washing to Germophobe-Joe's Boiling Laundry, and he insists on washing at 90 degrees centigrade. This has ruined a lot of my customers clothes over the years. Blame him for your shirt. Or blame yourself for giving me laundry which was not boilable even though I always use Joe's." Party A might respond that it does not give a damn about any Joe, but B was negligent in selecting the wrong subcontractor, and thus takes the full blame.

If one uses the data method, then a string_view is functionally equivalent to a character pointer. If one intends to use it that way, I would much prefer to use a const char* instead, because that would make the intent more plain. Or we could use a custom ZeroTerminatedConstString wrapper. If you need a class for a square, don't just take a rectangle class and assume that both sides are the same length.

I would give a second thought about this. Since we are dealing with a template, which, presumably, needs to be viewed as a standard way to implement many task classes. If all of them have this string copying, it may have some impacts on performance.

So far I'm aware, all users of R3BRoot have kept their task names to below a megabyte, and their number of tasks per run in the low thousands. :-)

As a rule of thumb, if a thing happens outside the event loop and is not by itself very costly, it is likely not worth worrying about. TObject::fTitle likely already creates a copy of the const char*, another copy won't hurt too much.

If, a few decades down the road, it becomes fashionable to create new tasks within the event loop, and performance measurements point to the copying of the task names out of all things as the bottleneck, then we can always switch from std::string to const char* and fix performance.

YanzhaoW commented 1 year ago

If one uses the data method, then a string_view is functionally equivalent to a character pointer.

That's not true. The data stored in a string_view is a char pointer PLUS a size, which makes the whole thing a lot better. :D

If one intends to use it that way, I would much prefer to use a const char* instead, because that would make the intent more plain.

const char* is certainly not an option IMO. Nothing can be plain when passing a data array using a pointer. We always have to assume that in our case const char* means an array not a pointer to a single const char. In the meanwhile, the we also have to assume it has a null terminator if we use const char*.

On the other hand, std::string_view doesn't have this issue. Even if the char array doesn't have null terminator, it's still fine. If you run following example:

#include <iostream>
#include <string>

void Print(const char* name) {
    auto str = std::string{name};
    std::cout << str << "\n";
}

void Print2(std::string_view name) {
    auto str = std::string{name};
    std::cout << str << "\n";
}

int main() {
    const char name[] = {'a', 'b', 'c', 'd', 'e'};
    const char name2 = 'a';
    const char name3 = 'b';
    Print(name);
    // Print2(name);
}

the printout may be abcdeba (this could be different as we are dealing with UB). However, if you use Print2, the result is always abcde.

So far I'm aware, all users of R3BRoot have kept their task names to below a megabyte, and their number of tasks per run in the low thousands.

That's true. But default is important. Something like always return const Type& from getters and pass const Type& to setters. Does it have a real performance improvement in every case? Probably not. But following these defaults, we don't need to think the details and put our focus on something more important. And in case of a name or a string, it's const std::string& by default. With C++17, const std::string& is replaced by std::string_view (in most cases).

Party A might respond that it does not give a damn about any Joe, but B was negligent in selecting the wrong subcontractor, and thus takes the full blame.

Suppose party B is a delivery guy. I guess most of us don't blame the delivery man for the bad food or mistakes made by either the costumer or the restaurant. :D

klenze commented 1 year ago

The thing is, the author of this class (B) is not just a delivery man. If B got an argument X of type x_t from your caller (A) and passed it to its base class as an x_t, then B could claim that any issues with X are between A and C.

The analogy here would be B taking a package labeled "not for human consumption", ripping of the label and writing "enjoy your meal!" on it instead.

std::string_view is safe and well-defined on its own, zero terminated char arrays are safe and well-defined only if you don't make any mistakes (which is why they are discouraged). Reinterpreting the former as the latter is not correct in some cases.

I see the following ways forward:

We lobby FairRoot into using state of the art C++. This would involve getting rid of TTask (which is obsolete) and maintaining their own base class.
We use std::string, which is not performance optimal, but can be safely converted to C strings.
We embrace Adorno ("Es gibt kein richtiges Leben im Falschen") and just take a const char. If you want to avoid ambiguities, you could have ``using zero_terminated_character_array=char`` and use that instead, or just write a comment after the pointer argument.

What we should not do is any of the following:

Testing if a std::string_view is safe to convert. By it's nature, string_view does not make claims about the memory ownership of about sv.data()[sv.size()], so even if testing if it is zero can already cause a segmentation fault.
Just relabeling using .data().
Creating a temporary std::string and using it's .c_str(), as in : FairTask(std::string(name).c_str()), because the std::string will be out of scope when FairTask gets called, and the char* invalid.
Write our own custom string or string_view class (not worth it, also, there are already too many incompatible QStrings and TStrings in the world).
Write multiple constructors differing on decltype(name).
Arguing endlessly about this minor point.

I agree that using const Type& for arguments is probably a good idea generally (at least if sizeof(Type)>sizeof(void*), and it is worthwhile if people pick that up.

However, the rules for performant string handling in physics analysis are very simple: Don't do strings processing per event. Outside event loops, what I expect from R3BRoot contributors is that they can use strings correctly. Fancy classes like string_view which don't own the buffer get complex to use when the life-time of the buffer is shorter than the life-time of these classes (e.g. if you had a string_view member variable).

If we were writing a parser, a regex library or a database management system, then we would have to be experts on fast string handling. As it is, even if the task constructor took a std::string by copy instead of by reference, that is unlikely to be within my top ten complaints about class implementation.

YanzhaoW commented 1 year ago

We embrace Adorno ("Es gibt kein richtiges Leben im Falschen") and just take a const char. If you want to avoid ambiguities, you could have using zero_terminated_character_array=char and use that instead, or just write a comment after the pointer argument.

Using the name of a variable or a type to indicate the constraint is a poor design because it doesn't make program crash at the right moment when it should. But let's just compare the following methods:

using zero_terminated_character_array=char*

using zero_terminated_character_array=std::string_view

The second one can do everything the first one can do. If we pass a NUL-terminated string, both work. If we pass a non-NUL-terminated string, both fail. And from the example above we could see when passing a char array (char[]) the second method works but the first fails. So the conclusion is std::string_view can do everything const char* can do and just be better in every way. I don't know why we are still talking about const char*. It's already out of game and right into legacy trash can. The only players in the game are std::string and std::string_view.

Testing if a std::string_view is safe to convert. By it's nature, string_view does not make claims about the memory ownership of about sv.data()[sv.size()], so even if testing if it is zero can already cause a segmentation fault.

Just relabeling using .data().

Creating a temporary std::string and using it's .c_str(), as in : FairTask(std::string(name).c_str()), because the std::string will be out of scope when FairTask gets called, and the char* invalid.

There is always a risk of passing a pointer because we could never know whether the data pointed is still valid or not. And I don't think there is something wrong with : FairTask(std::string(name).c_str()). The correct way to resolve this life-time issue is to always use the pointer (or a reference) on the fly and never store them into a member variables directly. Instead names should be stored in std::string. Same principle goes to std::span, std::mdspan and std::views.

klenze commented 1 year ago

You are indeed correct that : FairTask(std::string(name).c_str()) works. C++ mandates that temporary instances are deleted at the end of a full expression. (What would not work is return std::string(name).c_str().) IIRC I hade some problem related to temporaries when going from boost::format stuff to c strings at some point, but don't recall exactly what they were.

I agree that it is always preferable to use the type system to just using alias names, but I still think that there is a use for transporting intent using type alias names.

For example, I would preferentially call what is to the compiler a 32 bit signed integer int in contexts where I require some integer of a length of at least four bytes (or don't care about the length at all because I am counting to four), while if I require precisely four bytes (e.g. for memory alignment) I would call the same type int32_t.

Many functions have preconditions for their arguments which are not commonly expressed in a type system. For example:

f(unsigned int i) //i should be smaller than four
SetTime(double t) // in CLHEP units
std::vector<T> merge(std::vector<T> s0, std::vector<T> s1, auto comp) // s0 and s1 have to be ordered wrt the operator comp

If we knew that FairTask would get a std::string_view constructor at the end of the year, I would now prefer to use:

ExampleTask::ExampleTask(std::string_view name)
 : FairTask(std::string(name).c_str()){}

Then we would not have to change our interface when FairRoot starts using C++ strings.

However, in the reality we are inhabiting, this is not going to happen. So we are stuck with these options:

Cheap_A: using std::string_view and passing sv.data()
Cheap_B: using const char* and passing it
Safe_A: using const std::string& and passing s.c_str()
Safe_B: using std::string_view and constructing a temporary string, as discussed above.

My preference ordering is Safe_A>Safe_B>Cheap_B>Cheap_A. The TL;DR is the following:

Runtime here is not critical, so safety should be preferred.
Safe_A is not more costly than Safe_B and less costly in the case of the user already passing a std::string object (which would have to be converted to std::string_view and back for Safe_B), it is also more readable than Safe_B.
For the unsafe, cheap cases, we have a type whose default interpretation is zero terminated character array. It is called char*. (This is certainly the default interpretation ostream with operator<< uses when it encounters that type.) While I agree that std::string_view can not be mistaken for a pointer to a single character, the fact remains that string_views are generally not required to be zero terminated. If Cheap_A is called with a string constant, the following will happen:
- the string_view constructor will store a char* and also use the equivalent of strlen to store the length.
- both of these values will be stored on the stack
- As you just use sv.data(), the length which string_view carefully calculated for you is never used. Instead, the char* is reinterpreted as a zero-terminated string contrary to explicit intent of the authors of that class.

So in the end Cheap_A is more confusing (because you violate the default interpretation of a type) and more costly (because you take and record the length which you never use).

Last analogy (I promise). Suppose that someone heard that raw pointers are evil and one should use C++ pointer types instead. So they do the following:

struct Bar
{
  Foo a;
  std::unique_ptr<Foo> get()
  {
    // Use like this:
    //  auto p=MyBar.get();
    // Please set
    //  *(void**)&p=0;
    // before the variable storing the returned value goes out of scope!
    return std::unique_ptr<Foo>(&a);
  }
};

To that I would reply: "You are not using unique_ptr as intended, but abusing that class to transport a pointer with different ownership semantics. It is better to use 'evil' C constructs which have well known ambiguities in interpretation (like Foo) than violating the supposedly unambiguous interpretation of well known C++ constructs, which is about 100x as evil." (I am aware that in that specific case, there is an even better solution than keeping Foo, namely weak_ptr, but that is not my point.)

This is also exactly how I feel about using string_view to smuggle what will be interpreted as zero-terminated char arrays.

I think that taking a const std::string& and using name.c_str() is a good compromise between:

being safe
being unambiguous and non-misleading
promoting modern C++ (given our constraints)
being readable even for people who have little experience with C++ character string types
being reasonably fast

So rather than continuing the discussion if char* (which some people will mistake as a pointer to a single character) is more or less evil than your suggested use of string_view (which most people will mistake as having an explicit length rather than being reinterpreted as a zero terminated array), can we perhaps agree that std::string& is good enough and move on?

YanzhaoW commented 1 year ago

I made a wrong statement, when you pass a char* to std::string_view, it will not deduce the size of the char array. Rather the best way to pass char* to string_view is:

const char name[] = {'a', 'b', 'c'};
const auto name_view= std::string_view{name, std::size(name)};

Otherwise, it would still be UB.

The thing with const std::string& is that it still copies the string if you use const char*. Again, std::string_view doesn't have this problem.

If we knew that FairTask would get a std::string_view constructor at the end of the year, I would now prefer to use:

Let's ust be simple:

ExampleTask::ExampleTask(std::string_view name)
 : FairTask(name){}

Safe_B: using std::string_view and constructing a temporary string, as discussed above.

I don't understand here. Why don't you just use std::string and pass std::string::c_str() to FairTask?

For the unsafe, cheap cases, we have a type whose default interpretation is zero terminated character array. It is called char*. (This is certainly the default interpretation ostream with operator<< uses when it encounters that type.) While I agree that std::string_view can not be mistaken for a pointer to a single character, the fact remains that string_views are generally not required to be zero terminated.

The only difference in data structure between string_views and char (apart from the type name) is string_views has one additional data to describe the size of the array. You could interpret string_views as `char + size_t. Whatever is required or interpreted forcharapplies tostring_view. string_view is generally not required to be zero terminated.charis also not required to be zero terminated. I'm not sure your meaning of the "default interpretation". For me, the "default interpretation" of every string type is zero terminated, no matter it's represented bychar*,std::stringorstd::string_view`.

So when you break this interpretation and use a not-NUL-terminated char array, everything is broken no matter you use char*, string or string_view:

void Read_string(std::string name);
void Read_c_char(const char* name);
void Read_string_view(std::string_view name);

const char name[] = {'a', 'b', 'c'};

Read_string(name); // UB
Read_c_char(name); // UB
Read_string_view(name); // UB

Here is an example of using std::string.

So the best solution is just not using non-NUL-terminated string.

But if we still have to deal with the case of a non-NUL-terminated string without data copying, passing std::string_view is the only option. (unless you are a C programmer and do something like void Read(const char* name, int size))

YanzhaoW commented 1 year ago

"You are not using unique_ptr as intended, but abusing that class to transport a pointer with different ownership semantics. It is better to use 'evil' C constructs which have well known ambiguities in interpretation (like Foo) than violating the supposedly unambiguous interpretation of well known C++ constructs, which is about 100x as evil." (I am aware that in that specific case, there is an even better solution than keeping Foo, namely weak_ptr, but that is not my point.)

I won't say this example is abusing a new feature but rather simply a wrong usage. std::unique_ptr<Foo>(&a) is plainly wrong because a is a value and for a value, the concept of the "ownership" doesn't even exist. Therefore, unique_ptr here is not a choice. What we are discussing here is whether the usage of string_view here is appropriate or not. It's actually very simple to decide: Do you need string or const string&? if you need const string&, do you use C++17? If yes, you should use std::string_view.

So rather than continuing the discussion if char* (which some people will mistake as a pointer to a single character) is more or less evil than your suggested use of string_view (which most people will mistake as having an explicit length rather than being reinterpreted as a zero terminated array), can we perhaps agree that std::string& is good enough and move on?

Let's move on by just taking std::string since if it's used properly, it's as efficient as std::string_view.

klenze commented 1 year ago

Let's ust be simple: What I said was:

If we knew that FairTask would get a std::string_view constructor at the end of the year, I would now prefer to use

Of course, once they had that constructor we would switch to the simple version.

So when you break this interpretation and use a not-NUL-terminated char array, everything is broken no matter you use char*, string or string_view

The problem in these examples arises from the fact that you are implicitly constructing std::string, std::string_view using the const char* constructor without passing size.

However, if you construct a string/string_view by some other mean, there is no reason to expect that it's internal representation will be zero-terminated. (Only since C++11 zero termination is mandated for std::string; before that, implementations were free to use realloc when someone called c_str and they had to append a zero byte.)

Whatever is required or interpreted for char applies to string_view. string_view is generally not required to be zero terminated. char is also not required to be zero terminated.

char* is assumed in many contexts (e.g. without a separate length argument) to be zero terminated (because otherwise you could not know the length). By contrast, almost nobody assumes that the internal representation of string_view is zero terminated (and std::string_view::data explcitly warns you not to assume that) because the length of a string_view is a member of that class.

Do you need string or const string&? if you need const string&, do you use C++17? If yes, you should use std::string_view.

That is true if you do not require access to the internal representation. If you need to cheaply convert it to a zero terminated character array (which we do), then string_view is not an option.

Let's move on by just taking std::string since if it's used properly, it's as efficient as std::string_view.

Agreed.

I think the moral of the story is that C++ inherited a lot of cruft from C, and parts of ROOT very much rely on that cruft. (If the language was designed from the start, there would be no reason why the literal "foo" should not be a string_view or why int a[5] should not be equivalent to std::array<int, 5> a, or why we would need raw pointers, references and smart pointers, or why 010 should not be the same literal as 10, or why we have both parenthesis and curlies for constructor arguments, or why char and int8_t are the same type or why templates reuse comparison symbols for their parameters, or why we are stuck with that horrible excuse for a preprocessor instead of what rust offers.)

Of course, this legacy support is also the reason for C++'s success.

YanzhaoW commented 1 year ago

Ok, I change to const std::string&, because if I use std::string, clang-tidy detects the string is not moved and can be totally changed into const std::string& to avoid unnecessary copying.

jose-luis-rs commented 1 year ago

Hello @YanzhaoW is this ready? Thanks in advance

ajedele commented 1 year ago

What is the plan for implementing this template?

Will you take all current code, replace the current standard with the new template?

Or is the expectation that all 'new' code needs to be in the new template. If that's the case, this is forcing the people who are nice enough to expand upon the code to do the copying and pasting. I object to this. This would be forcing a template upon us and washing you hands clean of the responsibility that comes with it.

Best regards, Andrea

YanzhaoW commented 1 year ago

@jose-luis-rs Yes.

@ajedele

What is the plan for implementing this template?

The plan is same as when it was created in the first place.

Will you take all current code, replace the current standard with the new template?

No, we don't have neither the manpower nor expertise.

Or is the expectation that all 'new' code needs to be in the new template.

I would say it's suggested to do that. In my opinion, it's more like an example instead of a rule. If you create a new class, rather than copying a low-quality code, copy this. If you can't do exactly the same thing, then do it in your own way. But we should always hold a higher standard to the new code in folders like r3bdata and r3bsource because they are used by everyone.

This would be forcing a template upon us and washing you hands clean of the responsibility that comes with it.

I don't quite understand this. This isn't such "clean of the responsibility". You should be always and solely responsible for the code you create. I created this template. I'm responsible for this template. You create a Cal2Hit class. You are responsible for the Cal2Hit class.

klenze commented 1 year ago

Agreed, the template is meant as an acceptable basis on which to build new classes. It is clearly not the only acceptable basis.

For example, I would be ok with a FairTask which:

Still uses const char* name in the constructor
Does the silly TCA-handling by hand or through another mechanism instead of the IOconnector

I think that we should only strictly enforce standards which are already enforced on our existing code (and perhaps easily fixed stuff such as "initialize all variables and members in-line"). If a PR is above the ~60th percentile of the code base in terms of maintainability, that is good enough for me.

hapol commented 1 year ago

I also agree. A template should be updated for improvements or corrected. We should have the best possible example for new classes. If a developer uses the last version, he/she could profit from these corrections. If not, we should fix more things during the PR phase, but as @YanzhaoW said, it is suggested to use the new, not obliged. It cannot be objected based on the effects on the code already written but not yet committed, ... then it will never be a right moment for correcting or updating.