Closed Night-walker closed 10 years ago
Well, I'm really not convinced that always making a copy is a good idea. Actually quite the opposite. If I imagine how such an internationalized array of characters look like in latin-based languages, the non-ASCII characters take up around 30% of them. Therefore we could even introduce "lazy" string copying in implementation of convert()
replace()
expand()
such that these will work in-situ
until they come across non-ASCII character (i.e. 2+ bytes long one) when a new string will be allocated and the already-walked-through characters copyied to it.
All the 3 cases @Night-walker mentioned doesn't seem that common to me, so I'm not convinced this change to "always make a copy" approach will be useful (and now I'm talking primarily about performance).
But yes, we could provide another String
type with proposed invar-friendliness to match needs of purists :) who don't care that much about performance and simplicity.
Again, I don't quite understand you :) convert()
, replace()
, expand()
and change()
already work with an internal string buffer. A copy if you like, though it isn't a correct word. Adjusting the interface does not lead to "making a copy", it only changes the data flow so as to make it more streamlined.
I didn't look into the code..., sorry. Now I did :). Could you please elaborate on how the change could look like e.g. in DString_FindReplace
? Anyway, I agree with DaoProcess_PutReference
removal if possible.
Could you please elaborate on how the change could look like e.g. in DString_FindReplace?
DString_FindReplace()
currently may
The modified function would simply create a new string and return it. The average-case performance might even become better, depending on how often detaching may happen with the present implementation.
It seems most of languages and frameworks use non-modifying replace()
, trim()
, etc. Even Qt does it, so performance is certainly not a problem. It's just a matter of the interface: I find that methods which return new string are more "clean" and flexible. And utilizing them like str = str.replace()
when you want an in-place change is not a problem, unlike using the current methods for invar
-intensive code.
Actually expand()
already returns a new string (convert()
was also intended so, as shown by the invar
in the signature, but looking into the code, I found I forgot to update the code itself). And replace()
and change()
do not return a string, instead, they return a integer for the number of replaces and changes that have been made. So no confusion here.
But method such as chop()
and trim()
do return a reference of itself, the original rationale for this is that this will allow some methods to be chained in consecutive calls. But this is really not important. So I agree to change this. But returning a new copy of the string does not seem a convincingly good idea, because the names of these do imply modification of the string itself, just like erase()
, replace()
and change()
etc. So I will change them to return nothing (none
by default actually).
1) Compatibility with
invar
. 2) Better flexibility. 3) Elimination of references for primitive data.
You have mentioned very tempting argument for the changes you are proposing. I agree DaoProcess_PutReference()
must go. But unless we remove clear()
, and change insert()
and erase()
, change/remove resize()
etc. as well, your changes will still leave some inconsistency. I have considered something similar before, I was just not very comfortable to such big changes. So now the question is, is it really better to disable string modification entirely by methods?
And replace() and change() do not return a string, instead, they return a integer for the number of replaces and changes that have been made.
Which is rather inflexible and ineffective:
routine f(str: string) => string
str.replace(...) # detach, full copy and then replacing -- absurd
return str
}
I don't need the number of substitutions -- it is rarely a useful information. I need a convenient and effective way to do an invar
replace, which Dao does not provide and which cannot be implemented in Dao itself. Making replace()
return result in a new string is a much more flexible solution, and much more efficient if you care about invar
.
But method such as chop() and trim() do return a reference of itself, the original rationale for this is that this will allow some methods to be chained in consecutive calls. But this is really not important. So I agree to change this. But returning a new copy of the string does not seem a convincingly good idea, because the names of these do imply modification of the string itself, just like erase(), replace() and change() etc. So I will change them to return nothing (none by default actually).
That would be even worse because of lack of rvalue. Why not simply rename them as trimmed()
and chopped()
? As I said, trimming and chopping into a new string should not visibly affect performance taking into account how such operations are usually used.
You have mentioned very tempting argument for the changes you are proposing. I agree DaoProcess_PutReference() must go. But unless we remove clear(), and change insert() and erase(), change/remove resize() etc. as well, your changes will still leave some inconsistency. I have considered something similar before, I was just not very comfortable to such big changes. So now the question is, is it really better to disable string modification entirely by methods?
There is no need to change insert()
, erase()
, etc. -- they don't hinder the use of invar
strings. These operations can be done with slicing plus concatenation, if one wants it that way, and it's not very cumbersome. Also, it's easy to create invar
wrappers for such methods, and they will be effective -- unlike for the current replace()
or change()
, for instance.
I suggest taking a close look at Qt. I highly appreciate how simple and convenient it handles various things -- and effectively, no doubts. It happens that QString
provides exactly the interface I proposed, even despite Qt is definitely not build around functional programming or immutability or anything. And there is nothing inconsistent there.
I have a different opinion especially on the trim
and chop
methods. In most cases I've seen they're used (and I do it so as well) on large strings (e.g. read whole file/stdin/... and immediately run trim/chop).
Second, QString
is not so fast as other string libraries (especially those in C) and therefore I don't consider it as an argument for a good invar interface.
Third, I have no idea why you in most cases where you change/replace something in a string need the original string as well? Any examples?
Fourth, I agree that if I want to substitute something in a string, I'll definitely use regex methods as I want to work with characters and not bytes.
I have a different opinion especially on the trim and chop methods. In most cases I've seen they're used (and I do it so as well) on large strings (e.g. read whole file/stdin/... and immediately run trim/chop).
Perhaps, though in-place trimming is only fast when no leading whitespace is present. Most modern languages seem to provide only non-mutating string trimming/chopping. Apparently, performance regression is not deemed to be an issue.
Second, QString is not so fast as other string libraries (especially those in C) and therefore I don't consider it as an argument for a good invar interface.
Really? Dao is around two orders of magnitude slower then C, and its runtime performance is mostly affected by various stuff around the bytecode. Most other things which don't deal with file access or so has rather low impact on the overall performance.
Moreover, comparing QString
to some simplistic third-party C strings which don't even use implicit sharing is rather questionable. I can bet that QString
is faster then the internal DString
used in Dao due to the use of atomic operations, UTF-16 and sometimes more refined algorithms. But it's not really that important.
Third, I have no idea why you in most cases where you change/replace something in a string need the original string as well? Any examples?
It's not really about keeping the original string. I would always use invar
as much as possible in a sufficiently large code, and current replace()
and change()
probably create the most drag with strings here. And I often find the need to chain or compose string processing, as it's simply convenient.
Fourth, I agree that if I want to substitute something in a string, I'll definitely use regex methods as I want to work with characters and not bytes.
Not at all. If you don't wan't to substitute a pattern, there is absolutely no reason in using regex methods. Characters and bytes don't matter here, I don't understand why you persist to believe that you have to "work with characters" in some special way.
I can understand why you want replace()
to return a new string. For replace()
, there may be the cases where you want to replace the same (possibly invar
) string in different ways to generate different strings. But trim()
and chop()
are different, why would one want to pass a invar
string to these methods? Also there are only a few ways to trim and two ways chop a string. Calling them on the same string to produce new strings does not seem reasonable to me. For me, trim()
and chop()
are the same type of methods as erase()
and insert()
, so I don't see reasons to handle them differently. In some way, replace()
and change()
are also the same type of methods as erase()
and insert()
. I am still not convinced that they should be handled differently.
Of course, one solution is to provide overloaded methods for invar
strings to return new strings. Then another question is, do we really need more methods?
The problem with all those methods is that they are not well-compatible with invar
and are not flexible. As I showed, their use often leads to needless detaching which may essentially eliminate any performance advantage over in-place operations -- even if you don't use invar
at all. And they don't have any real convenience benefits either.
On the contrary, emulating an in-place changes with invar
methods is trivial and obviously does not have negative effect on the performance.
To summarize, mutating methods comparing to their invar
alternatives are:
replace()
, change()
) ortrim()
, convert()
)invar
-intensive codeI think invar
methods would generally be more flexible and robust, particularly in case of replace()
and change()
. Perhaps we could provide both mutating and invar
methods for certain cases, but the latter should definitely be the preferred choice. Virtually all modern languages default to non-mutating string handling, even for insert()
and erase()
.
Maybe we could provide replace()
and replaced()
, trim()
and trimmed()
and etc. If that's too much, at least replace()
and change()
could be turned into invar
methods, maybe adding trimmed()
and chopped()
as well.
However, I'm starting to think that making all string methods invar
is not such a radical idea after all, looking at other languages. It may be the most simple, consistent and balanced option among the others, and I prefer it at the moment.
@Night-walker, those are strong words and I'm going to deny commenting on this issue until we find few good string benchmarks and run them against the current string methods implementation and the proposed "invar & copy" implementation.
Btw take a look at http://www.ucw.cz/libucw/ - this is one of best C libraries I've ever seen to handle strings and also much more high-level stuff.
@Night-walker, those are strong words and I'm going to deny commenting on this issue until we find few good string benchmarks and run them against the current string methods implementation and the proposed "invar & copy" implementation.
I suppose the fact that most high-level languages emphasize non-mutating string handling speaks for itself. Unless they employ some tricky detection of temporary values (doubtful, but can't say for sure), performance is apparently not considered to be of concern here.
Still, if you care about squeezing as much speed as possible, the performance of invar
methods vs non-invar
ones basically depends on aliasing. More aliasing -- more needless detaching for non-invar
methods. Less aliasing -- more needless copying for invar
methods. This is not easy to benchmark, as the results will depend on the organization of code rather then on its logic.
slightly faster when no aliasing takes place (only tail-trimming and chopping may have a significant bonus here) definitely slower otherwise (up to two times with most methods due to detaching!)
These in place operations are not about performance, but about semantics. As I argued in my previous reply, trimming, replacing or changing normally means processing the self string itself. Making them to support invar
string and return new strings just don't sounds very consistent with the method names, and also not consistent with how insert()
and erase()
is handle.
Also improvements can be made to avoid unnecessary detaching.
not usable as rvalues (replace(), change()) or
The returned value can be changed to return a shallow copy of the self string. But I really don't see it as a serious issue.
use hidden, non-intuitive referencing (trim(), convert())
Referencing has been disable.
require wrappers for invar-intensive code
I have to see examples, where invar
strings and these trim()
, chop()
, replace()
and change()
are mixed extensively. It's just logically unlike to be common cases.
However, I'm starting to think that making all string methods invar is not such a radical idea after all
That's what I mean by "disable string modification entirely by methods". We could even remove insert()
and erase()
, which are rarely used (by myself at least), and are not included in Java String as well. Of course, Java string is just string, it has separated type for byte array. But Dao string is intended as both text string and byte array. If we make Dao string immutable by methods, it will become much less useful as byte array. So I am less convinced it is a good idea.
These in place operations are not about performance, but about semantics. As I argued in my previous reply, trimming, replacing or changing normally means processing the self string itself. Making them to support invar string and return new strings just don't sounds very consistent with the method names, and also not consistent with how insert() and erase() is handle.
Semantics is just a matter of name and convention. trimmed()
, for instance, is the name QString
uses, obviously to distinguish it from mutating methods. C# and Java, however, simply use trim()
since their string don't have mutating methods at all.
I have to see examples, where invar strings and these trim(), chop(), replace() and change() are mixed extensively. It's just logically unlike to be common cases.
There is nothing uncommon in this code:
routine f() => string {
invar str = some_func()
...
return str.replace(...) # => string
}
# or simply
invar str = some_func().replace(...)
Anywhere a string is obtained into an invar
(or exists in the form of temporary rvalue), it isn't possible to use these methods without a wrapper. Of course, one can omit invar
-- but a simple string operation ruining the whole idiom of immutability just seems inadequate.
That's what I mean by "disable string modification entirely by methods". We could even remove insert() and erase(), which are rarely used (by myself at least), and are not included in Java String as well. Of course, Java string is just string, it has separated type for byte array. But Dao string is intended as both text string and byte array. If we make Dao string immutable by methods, it will become much less useful as byte array. So I am less convinced it is a good idea.
First, insert()
and erase()
are rather unlikely to be used with byte arrays and, more importantly, they can be achieved via [slice]=
operator. The need to use string as a byte array is rather uncommon by itself -- for binary data, for instance, array
and sys::buffer
are usually better alternatives. But even if string is more fitting for a particular case, having only invar
methods for it still does not limit its functionality as a byte array at all.
Second, if all string methods are invar
, insert()
and erase()
can be preserved -- remade into returning new strings, of course. There would be no inconsistency as all methods would work in a similar manner. That is how C# handles it.
Still not convinced, but I am planning to do the following to make it easier to use those methods with invar strings:
string[]
so that []
on invar string will produce mutable string;chop()
, trim()
, replace()
and change()
to a shallow copy of the self string.With these changes, your code can simply rewritten as,
routine f() => string {
invar str = some_func()
...
return str[].replace(...) # => string
}
# or simply
invar str = some_func()[].replace(...)
Too ritualistic. I don't want to care about some magic []
when I just handle strings -- even wrappers seem better. The syntax should not get in the programmer's way like like that.
Why do you think that in-place methods are essential? Taking into account implicit sharing and invar
, non-mutating methods definitely seem like the most optimal choice to me.
Too ritualistic. I don't want to care about some magic [] when I just handle strings
Really? You think that I make this fix just for this issue, don't you? [...]
is a standard operator of string, array and list etc. Currently [...]
on an invar string results in an invar string, which is clearly not a reasonable design since the resulting string or substring is no longer part of the original string. An empty subindex []
normally means all the bytes of a string or items of a list etc., which effectively means copying the value if it is supported.
Why do you think that in-place methods are essential?
Well, not that I think in-place methods are essential, it's just that I am not comfortable with the inconsistency between method names and what they do. Though such inconsistency is not an issue if we make all the methods invar
. But I just don't think anyone would find it reasonable that insert()
will return a new string.
The only acceptable way to make methods such as replace()
invar, is to make all methods invar and kick out methods such as resize()
, clear()
, insert()
and erase()
. A new method such as new( length: int )
may be necessary to make it convenient to create string of certain length. But I wonder if such changes is a bit too radical.
Really? You think that I make this fix just for this issue, don't you? [...] is a standard operator of string, array and list etc. Currently [...] on an invar string results in an invar string, which is clearly not a reasonable design since the resulting string or substring is no longer part of the original string. An empty subindex [] normally means all the bytes of a string or items of a list etc., which effectively means copying the value if it is supported.
I knew that []
copies the container. I have no objection for it to return a mutable string. But using it to work with invar
strings is weird.
Well, not that I think in-place methods are essential, it's just that I am not comfortable with the inconsistency between method names and what they do. Though such inconsistency is not an issue if we make all the methods invar. But I just don't think anyone would find it reasonable that insert() will return a new string.
Exactly. Making all methods invar
will remove any inconsistency. Also, cases like insert()
returning a new string is a common sight in modern languages -- I don't think it can cause any confusion, particularly taking into account that the typing system will highlight any related errors early on.
The only acceptable way to make methods such as replace() invar, is to make all methods invar and kick out methods such as resize(), clear(), insert() and erase(). A new method such as new( length: int ) may be necessary to make it convenient to create string of certain length. But I wonder if such changes is a bit too radical.
I would rather remake insert()
and erase()
into invar
-- for the reasons stated above. But resize()
and clear()
would certainly need to be removed.
I think you shouldn't put so much emphasis on names in prejudice of everything else. There are far more confusing things then an imprecise mood (lingvistic), on which I don't normally pay attention at all.
I agree with @daokoder on everything he said. Actually, I have 3 main priorities in the following order (from the most important one): performance, low typing burden, appropriate naming.
I really dislike long method names and also don't like staying with some badly-named methods just because of their long-term historical usage across all existing languages. Please be free when designing a new thing.
It's also true, that I'd expect each method (disregarding the type/class it operates on) return something useful (if the return value doesn't make sense, just return self) to allow chaining. But this is a low-priority issue - performance is by far the most important for me.
Also, cases like
insert()
returning a new string is a common sight in modern languages I would rather remakeinsert()
anderase()
into invar
Another reason I would like to remove insert()
and erase()
is that, they may not as useful as they appear to be. For example, among 6K lines of Dao code I have written for text processing, there is not a even single use of them, which render me to question their usefulness. Then I checked languages such Python, Scala and Java, I found they don't provide these methods either (Ruby does insert()
, but no erase()
. However, Ruby has over 100 string methods! So it is not a surprise that it supports insert()
).
What's more, insert()
and erase()
can be easily done with sub-indexing operation. So, to sum it up, it is probably a good idea to remove these two functions.
I agree with @daokoder on everything he said. Actually, I have 3 main priorities in the following order (from the most important one): performance, low typing burden, appropriate naming.
invar
string methods have all of these :)
What's more, insert() and erase() can be easily done with sub-indexing operation. So, to sum it up, it is probably a good idea to remove these two functions.
OK, that's fine. I can also admit that erase()
and insert()
are indeed not very useful and can be dropped. Then, I suppose, there is no obstacles left for making the remaining methods invar
.
Then, I suppose, there is no obstacles left for making the remaining methods
invar
.
Of course, it is already done. I am just trying to move std.string()
, std.list()
, std.map()
and std.array()
to the corresponding types as constructors.
Done.
Good. Maybe it's even worth to disable passing by reference for self
in string methods (if it's handled in some special way) as it seems no longer necessary. That would signify that strings are always passed by value anywhere, and would allow to dispose of invar
for self
in the methods. It's not very important, just an occasional thought.
Yes, I have also been thinking about this too. Of course, we can disable passing by reference for self
parameter of primitive types. But I just want to point out that passing by reference will still exist for invar
variables, which shall be considered as part of implementation details (optimization).
But I just want to point out that passing by reference will still exist for invar variables, which shall be considered as part of implementation details (optimization).
Reasonable. Curious to see what impact can a single feature like invar
make on the language :)
I'm also curious - but mainly about performance :) @daokoder, could you please run some tests on those 6k lines of string handling with old Dao strings and these new strings? I really am eager to see some numbers, especially from real usage.
@daokoder, could you please run some tests on those 6k lines of string handling with old Dao strings and these new strings? I really am eager to see some numbers, especially from real usage.
Don't get you expectation too high. This optimization affects only a tiny fraction of the operations, it is very hard to see significant impact.
Yep, I saw the implementation changes and agree that the difference shouldn't be that significant as I initially expected.
I think this issue can be closed.
I think
string
interface should change a bit so as to adjust to the newinvar
feature. The methods likeconvert()
,replace()
,expand()
andchange()
should not modify the original string, returning a new string instead. They all already use a temporary string internally (don't miss my comment at #174 regarding lowercase/uppercase conversion), so this won't change their efficiency. Eventrim()
andchop()
could be changed that way, simply because these methods are usually not used on large strings, so the possible performance reduction is certainly negligible.As for the interface itself, I see the following advantages.
1) Compatibility with
invar
. That's essential to avoid breaking functional code style for such trivial operations as replacing, as well as to eliminate the need to writeinvar
wrappers for non-invar
methods.2) Better flexibility. I already stumbled upon the inconvenience of returning the result of
replace()
:var new = old; new.replace(...); return new
. Apparently, doing an in-place replacing with aninvar
replace()
is still trivial:str = str.replace()
. The number of occurrences replaced is a rather useless information most of the time, so I would definitely prefer aninvar
replace()
.3) Elimination of references for primitive data.
DaoProcess_PutReference()
is basically a small hack in the Dao typing system, and I'd rather discourage its use or exclude it completely from the VM. When you see af() => string
, you should be sure that it returns a "normal" string, rather then a reference to something somewhere. This is far more important then subtle, virtually insignificant increase in performance of methods liketrim()
andchop()
.