Adjusting string interface

Night-walker commented 10 years ago

I think string interface should change a bit so as to adjust to the new invar feature. The methods like convert(), replace(), expand() and change() should not modify the original string, returning a new string instead. They all already use a temporary string internally (don't miss my comment at #174 regarding lowercase/uppercase conversion), so this won't change their efficiency. Even trim() and chop() could be changed that way, simply because these methods are usually not used on large strings, so the possible performance reduction is certainly negligible.

As for the interface itself, I see the following advantages.

1) Compatibility with invar. That's essential to avoid breaking functional code style for such trivial operations as replacing, as well as to eliminate the need to write invar wrappers for non-invar methods.

2) Better flexibility. I already stumbled upon the inconvenience of returning the result of replace(): var new = old; new.replace(...); return new. Apparently, doing an in-place replacing with an invar replace() is still trivial: str = str.replace(). The number of occurrences replaced is a rather useless information most of the time, so I would definitely prefer an invar replace().

3) Elimination of references for primitive data. DaoProcess_PutReference() is basically a small hack in the Dao typing system, and I'd rather discourage its use or exclude it completely from the VM. When you see a f() => string, you should be sure that it returns a "normal" string, rather then a reference to something somewhere. This is far more important then subtle, virtually insignificant increase in performance of methods like trim() and chop().

dumblob commented 10 years ago

Well, I'm really not convinced that always making a copy is a good idea. Actually quite the opposite. If I imagine how such an internationalized array of characters look like in latin-based languages, the non-ASCII characters take up around 30% of them. Therefore we could even introduce "lazy" string copying in implementation of convert() replace() expand() such that these will work in-situ until they come across non-ASCII character (i.e. 2+ bytes long one) when a new string will be allocated and the already-walked-through characters copyied to it.

All the 3 cases @Night-walker mentioned doesn't seem that common to me, so I'm not convinced this change to "always make a copy" approach will be useful (and now I'm talking primarily about performance).

But yes, we could provide another String type with proposed invar-friendliness to match needs of purists :) who don't care that much about performance and simplicity.

Night-walker commented 10 years ago

Again, I don't quite understand you :) convert(), replace(), expand() and change() already work with an internal string buffer. A copy if you like, though it isn't a correct word. Adjusting the interface does not lead to "making a copy", it only changes the data flow so as to make it more streamlined.

dumblob commented 10 years ago

I didn't look into the code..., sorry. Now I did :). Could you please elaborate on how the change could look like e.g. in DString_FindReplace? Anyway, I agree with DaoProcess_PutReference removal if possible.

Night-walker commented 10 years ago

Could you please elaborate on how the change could look like e.g. in DString_FindReplace?

DString_FindReplace() currently may

detach the original string (which is quite likely to make its copy) and then change it (additional partial copying of data)
create a new string, use it to form the result and then assign it to the original string

The modified function would simply create a new string and return it. The average-case performance might even become better, depending on how often detaching may happen with the present implementation.

It seems most of languages and frameworks use non-modifying replace(), trim(), etc. Even Qt does it, so performance is certainly not a problem. It's just a matter of the interface: I find that methods which return new string are more "clean" and flexible. And utilizing them like str = str.replace() when you want an in-place change is not a problem, unlike using the current methods for invar-intensive code.

daokoder commented 10 years ago

Actually expand() already returns a new string (convert() was also intended so, as shown by the invar in the signature, but looking into the code, I found I forgot to update the code itself). And replace() and change() do not return a string, instead, they return a integer for the number of replaces and changes that have been made. So no confusion here.

But method such as chop() and trim() do return a reference of itself, the original rationale for this is that this will allow some methods to be chained in consecutive calls. But this is really not important. So I agree to change this. But returning a new copy of the string does not seem a convincingly good idea, because the names of these do imply modification of the string itself, just like erase(), replace() and change() etc. So I will change them to return nothing (none by default actually).

1) Compatibility with invar. 2) Better flexibility. 3) Elimination of references for primitive data.

You have mentioned very tempting argument for the changes you are proposing. I agree DaoProcess_PutReference() must go. But unless we remove clear(), and change insert() and erase(), change/remove resize() etc. as well, your changes will still leave some inconsistency. I have considered something similar before, I was just not very comfortable to such big changes. So now the question is, is it really better to disable string modification entirely by methods?

Night-walker commented 10 years ago

And replace() and change() do not return a string, instead, they return a integer for the number of replaces and changes that have been made.

Which is rather inflexible and ineffective:

routine f(str: string) => string
    str.replace(...) # detach, full copy and then replacing -- absurd
    return str
}

I don't need the number of substitutions -- it is rarely a useful information. I need a convenient and effective way to do an invar replace, which Dao does not provide and which cannot be implemented in Dao itself. Making replace() return result in a new string is a much more flexible solution, and much more efficient if you care about invar.

But method such as chop() and trim() do return a reference of itself, the original rationale for this is that this will allow some methods to be chained in consecutive calls. But this is really not important. So I agree to change this. But returning a new copy of the string does not seem a convincingly good idea, because the names of these do imply modification of the string itself, just like erase(), replace() and change() etc. So I will change them to return nothing (none by default actually).

That would be even worse because of lack of rvalue. Why not simply rename them as trimmed() and chopped()? As I said, trimming and chopping into a new string should not visibly affect performance taking into account how such operations are usually used.

You have mentioned very tempting argument for the changes you are proposing. I agree DaoProcess_PutReference() must go. But unless we remove clear(), and change insert() and erase(), change/remove resize() etc. as well, your changes will still leave some inconsistency. I have considered something similar before, I was just not very comfortable to such big changes. So now the question is, is it really better to disable string modification entirely by methods?

There is no need to change insert(), erase(), etc. -- they don't hinder the use of invar strings. These operations can be done with slicing plus concatenation, if one wants it that way, and it's not very cumbersome. Also, it's easy to create invar wrappers for such methods, and they will be effective -- unlike for the current replace() or change(), for instance.

I suggest taking a close look at Qt. I highly appreciate how simple and convenient it handles various things -- and effectively, no doubts. It happens that QString provides exactly the interface I proposed, even despite Qt is definitely not build around functional programming or immutability or anything. And there is nothing inconsistent there.

dumblob commented 10 years ago

I have a different opinion especially on the trim and chop methods. In most cases I've seen they're used (and I do it so as well) on large strings (e.g. read whole file/stdin/... and immediately run trim/chop).

Second, QString is not so fast as other string libraries (especially those in C) and therefore I don't consider it as an argument for a good invar interface.

Third, I have no idea why you in most cases where you change/replace something in a string need the original string as well? Any examples?

Fourth, I agree that if I want to substitute something in a string, I'll definitely use regex methods as I want to work with characters and not bytes.

Night-walker commented 10 years ago

I have a different opinion especially on the trim and chop methods. In most cases I've seen they're used (and I do it so as well) on large strings (e.g. read whole file/stdin/... and immediately run trim/chop).

Perhaps, though in-place trimming is only fast when no leading whitespace is present. Most modern languages seem to provide only non-mutating string trimming/chopping. Apparently, performance regression is not deemed to be an issue.

Second, QString is not so fast as other string libraries (especially those in C) and therefore I don't consider it as an argument for a good invar interface.

Really? Dao is around two orders of magnitude slower then C, and its runtime performance is mostly affected by various stuff around the bytecode. Most other things which don't deal with file access or so has rather low impact on the overall performance.

Moreover, comparing QString to some simplistic third-party C strings which don't even use implicit sharing is rather questionable. I can bet that QString is faster then the internal DString used in Dao due to the use of atomic operations, UTF-16 and sometimes more refined algorithms. But it's not really that important.

Third, I have no idea why you in most cases where you change/replace something in a string need the original string as well? Any examples?

It's not really about keeping the original string. I would always use invar as much as possible in a sufficiently large code, and current replace() and change() probably create the most drag with strings here. And I often find the need to chain or compose string processing, as it's simply convenient.

Fourth, I agree that if I want to substitute something in a string, I'll definitely use regex methods as I want to work with characters and not bytes.

Not at all. If you don't wan't to substitute a pattern, there is absolutely no reason in using regex methods. Characters and bytes don't matter here, I don't understand why you persist to believe that you have to "work with characters" in some special way.

daokoder commented 10 years ago

I can understand why you want replace() to return a new string. For replace(), there may be the cases where you want to replace the same (possibly invar) string in different ways to generate different strings. But trim() and chop() are different, why would one want to pass a invar string to these methods? Also there are only a few ways to trim and two ways chop a string. Calling them on the same string to produce new strings does not seem reasonable to me. For me, trim() and chop() are the same type of methods as erase() and insert(), so I don't see reasons to handle them differently. In some way, replace() and change() are also the same type of methods as erase() and insert(). I am still not convinced that they should be handled differently.

Of course, one solution is to provide overloaded methods for invar strings to return new strings. Then another question is, do we really need more methods?

Night-walker commented 10 years ago

The problem with all those methods is that they are not well-compatible with invar and are not flexible. As I showed, their use often leads to needless detaching which may essentially eliminate any performance advantage over in-place operations -- even if you don't use invar at all. And they don't have any real convenience benefits either.

On the contrary, emulating an in-place changes with invar methods is trivial and obviously does not have negative effect on the performance.

To summarize, mutating methods comparing to their invar alternatives are:

slightly faster when no aliasing takes place (only tail-trimming and chopping may have a significant bonus here)
definitely slower otherwise (up to two times with most methods due to detaching!)
not usable as rvalues (replace(), change()) or
use hidden, non-intuitive referencing (trim(), convert())
require wrappers for invar-intensive code

I think invar methods would generally be more flexible and robust, particularly in case of replace() and change(). Perhaps we could provide both mutating and invar methods for certain cases, but the latter should definitely be the preferred choice. Virtually all modern languages default to non-mutating string handling, even for insert() and erase().

Maybe we could provide replace() and replaced(), trim() and trimmed() and etc. If that's too much, at least replace() and change() could be turned into invar methods, maybe adding trimmed() and chopped() as well.

However, I'm starting to think that making all string methods invar is not such a radical idea after all, looking at other languages. It may be the most simple, consistent and balanced option among the others, and I prefer it at the moment.

dumblob commented 10 years ago

@Night-walker, those are strong words and I'm going to deny commenting on this issue until we find few good string benchmarks and run them against the current string methods implementation and the proposed "invar & copy" implementation.

Btw take a look at http://www.ucw.cz/libucw/ - this is one of best C libraries I've ever seen to handle strings and also much more high-level stuff.

Night-walker commented 10 years ago

@Night-walker, those are strong words and I'm going to deny commenting on this issue until we find few good string benchmarks and run them against the current string methods implementation and the proposed "invar & copy" implementation.

I suppose the fact that most high-level languages emphasize non-mutating string handling speaks for itself. Unless they employ some tricky detection of temporary values (doubtful, but can't say for sure), performance is apparently not considered to be of concern here.

Still, if you care about squeezing as much speed as possible, the performance of invar methods vs non-invar ones basically depends on aliasing. More aliasing -- more needless detaching for non-invar methods. Less aliasing -- more needless copying for invar methods. This is not easy to benchmark, as the results will depend on the organization of code rather then on its logic.

daokoder commented 10 years ago

slightly faster when no aliasing takes place (only tail-trimming and chopping may have a significant bonus here) definitely slower otherwise (up to two times with most methods due to detaching!)

These in place operations are not about performance, but about semantics. As I argued in my previous reply, trimming, replacing or changing normally means processing the self string itself. Making them to support invar string and return new strings just don't sounds very consistent with the method names, and also not consistent with how insert() and erase() is handle.

Also improvements can be made to avoid unnecessary detaching.

not usable as rvalues (replace(), change()) or

The returned value can be changed to return a shallow copy of the self string. But I really don't see it as a serious issue.

use hidden, non-intuitive referencing (trim(), convert())

Referencing has been disable.

require wrappers for invar-intensive code

I have to see examples, where invar strings and these trim(), chop(), replace() and change() are mixed extensively. It's just logically unlike to be common cases.

However, I'm starting to think that making all string methods invar is not such a radical idea after all

That's what I mean by "disable string modification entirely by methods". We could even remove insert() and erase(), which are rarely used (by myself at least), and are not included in Java String as well. Of course, Java string is just string, it has separated type for byte array. But Dao string is intended as both text string and byte array. If we make Dao string immutable by methods, it will become much less useful as byte array. So I am less convinced it is a good idea.

Night-walker commented 10 years ago

These in place operations are not about performance, but about semantics. As I argued in my previous reply, trimming, replacing or changing normally means processing the self string itself. Making them to support invar string and return new strings just don't sounds very consistent with the method names, and also not consistent with how insert() and erase() is handle.

Semantics is just a matter of name and convention. trimmed(), for instance, is the name QString uses, obviously to distinguish it from mutating methods. C# and Java, however, simply use trim() since their string don't have mutating methods at all.

I have to see examples, where invar strings and these trim(), chop(), replace() and change() are mixed extensively. It's just logically unlike to be common cases.

There is nothing uncommon in this code:

routine f() => string {
    invar str = some_func()
    ...
    return str.replace(...) # => string
}

# or simply

invar str = some_func().replace(...)

Anywhere a string is obtained into an invar (or exists in the form of temporary rvalue), it isn't possible to use these methods without a wrapper. Of course, one can omit invar -- but a simple string operation ruining the whole idiom of immutability just seems inadequate.

That's what I mean by "disable string modification entirely by methods". We could even remove insert() and erase(), which are rarely used (by myself at least), and are not included in Java String as well. Of course, Java string is just string, it has separated type for byte array. But Dao string is intended as both text string and byte array. If we make Dao string immutable by methods, it will become much less useful as byte array. So I am less convinced it is a good idea.

First, insert() and erase() are rather unlikely to be used with byte arrays and, more importantly, they can be achieved via [slice]= operator. The need to use string as a byte array is rather uncommon by itself -- for binary data, for instance, array and sys::buffer are usually better alternatives. But even if string is more fitting for a particular case, having only invar methods for it still does not limit its functionality as a byte array at all.

Second, if all string methods are invar, insert() and erase() can be preserved -- remade into returning new strings, of course. There would be no inconsistency as all methods would work in a similar manner. That is how C# handles it.

daokoder commented 10 years ago

Still not convinced, but I am planning to do the following to make it easier to use those methods with invar strings:

Fix string[] so that [] on invar string will produce mutable string;
Change the return value of chop(), trim(), replace() and change() to a shallow copy of the self string.

With these changes, your code can simply rewritten as,

routine f() => string {
    invar str = some_func()
    ...
    return str[].replace(...) # => string
}
# or simply
invar str = some_func()[].replace(...)

Night-walker commented 10 years ago

Too ritualistic. I don't want to care about some magic [] when I just handle strings -- even wrappers seem better. The syntax should not get in the programmer's way like like that.

Why do you think that in-place methods are essential? Taking into account implicit sharing and invar, non-mutating methods definitely seem like the most optimal choice to me.

daokoder commented 10 years ago

Too ritualistic. I don't want to care about some magic [] when I just handle strings

Really? You think that I make this fix just for this issue, don't you? [...] is a standard operator of string, array and list etc. Currently [...] on an invar string results in an invar string, which is clearly not a reasonable design since the resulting string or substring is no longer part of the original string. An empty subindex [] normally means all the bytes of a string or items of a list etc., which effectively means copying the value if it is supported.

Why do you think that in-place methods are essential?

Well, not that I think in-place methods are essential, it's just that I am not comfortable with the inconsistency between method names and what they do. Though such inconsistency is not an issue if we make all the methods invar. But I just don't think anyone would find it reasonable that insert() will return a new string.

The only acceptable way to make methods such as replace() invar, is to make all methods invar and kick out methods such as resize(), clear(), insert() and erase(). A new method such as new( length: int ) may be necessary to make it convenient to create string of certain length. But I wonder if such changes is a bit too radical.

Night-walker commented 10 years ago

Really? You think that I make this fix just for this issue, don't you? [...] is a standard operator of string, array and list etc. Currently [...] on an invar string results in an invar string, which is clearly not a reasonable design since the resulting string or substring is no longer part of the original string. An empty subindex [] normally means all the bytes of a string or items of a list etc., which effectively means copying the value if it is supported.

I knew that [] copies the container. I have no objection for it to return a mutable string. But using it to work with invar strings is weird.

Well, not that I think in-place methods are essential, it's just that I am not comfortable with the inconsistency between method names and what they do. Though such inconsistency is not an issue if we make all the methods invar. But I just don't think anyone would find it reasonable that insert() will return a new string.

Exactly. Making all methods invar will remove any inconsistency. Also, cases like insert() returning a new string is a common sight in modern languages -- I don't think it can cause any confusion, particularly taking into account that the typing system will highlight any related errors early on.

The only acceptable way to make methods such as replace() invar, is to make all methods invar and kick out methods such as resize(), clear(), insert() and erase(). A new method such as new( length: int ) may be necessary to make it convenient to create string of certain length. But I wonder if such changes is a bit too radical.

I would rather remake insert() and erase() into invar -- for the reasons stated above. But resize() and clear() would certainly need to be removed.

I think you shouldn't put so much emphasis on names in prejudice of everything else. There are far more confusing things then an imprecise mood (lingvistic), on which I don't normally pay attention at all.

dumblob commented 10 years ago

I agree with @daokoder on everything he said. Actually, I have 3 main priorities in the following order (from the most important one): performance, low typing burden, appropriate naming.

I really dislike long method names and also don't like staying with some badly-named methods just because of their long-term historical usage across all existing languages. Please be free when designing a new thing.

It's also true, that I'd expect each method (disregarding the type/class it operates on) return something useful (if the return value doesn't make sense, just return self) to allow chaining. But this is a low-priority issue - performance is by far the most important for me.

daokoder commented 10 years ago

Also, cases like insert() returning a new string is a common sight in modern languages I would rather remake insert() and erase() into invar

Another reason I would like to remove insert() and erase() is that, they may not as useful as they appear to be. For example, among 6K lines of Dao code I have written for text processing, there is not a even single use of them, which render me to question their usefulness. Then I checked languages such Python, Scala and Java, I found they don't provide these methods either (Ruby does insert(), but no erase(). However, Ruby has over 100 string methods! So it is not a surprise that it supports insert()).

What's more, insert() and erase() can be easily done with sub-indexing operation. So, to sum it up, it is probably a good idea to remove these two functions.

Night-walker commented 10 years ago

I agree with @daokoder on everything he said. Actually, I have 3 main priorities in the following order (from the most important one): performance, low typing burden, appropriate naming.

invar string methods have all of these :)

What's more, insert() and erase() can be easily done with sub-indexing operation. So, to sum it up, it is probably a good idea to remove these two functions.

OK, that's fine. I can also admit that erase() and insert() are indeed not very useful and can be dropped. Then, I suppose, there is no obstacles left for making the remaining methods invar.

daokoder commented 10 years ago

Then, I suppose, there is no obstacles left for making the remaining methods invar.

Of course, it is already done. I am just trying to move std.string(), std.list(), std.map() and std.array() to the corresponding types as constructors.

daokoder commented 10 years ago

Done.

Night-walker commented 10 years ago

Good. Maybe it's even worth to disable passing by reference for self in string methods (if it's handled in some special way) as it seems no longer necessary. That would signify that strings are always passed by value anywhere, and would allow to dispose of invar for self in the methods. It's not very important, just an occasional thought.

daokoder commented 10 years ago

Yes, I have also been thinking about this too. Of course, we can disable passing by reference for self parameter of primitive types. But I just want to point out that passing by reference will still exist for invar variables, which shall be considered as part of implementation details (optimization).

Night-walker commented 10 years ago

But I just want to point out that passing by reference will still exist for invar variables, which shall be considered as part of implementation details (optimization).

Reasonable. Curious to see what impact can a single feature like invar make on the language :)

dumblob commented 10 years ago

I'm also curious - but mainly about performance :) @daokoder, could you please run some tests on those 6k lines of string handling with old Dao strings and these new strings? I really am eager to see some numbers, especially from real usage.

daokoder commented 10 years ago

@daokoder, could you please run some tests on those 6k lines of string handling with old Dao strings and these new strings? I really am eager to see some numbers, especially from real usage.

Don't get you expectation too high. This optimization affects only a tiny fraction of the operations, it is very hard to see significant impact.

dumblob commented 10 years ago

Yep, I saw the implementation changes and agree that the difference shouldn't be that significant as I initially expected.

daokoder commented 10 years ago

I think this issue can be closed.

daokoder / dao

Adjusting string interface #187