Open VBAndCs opened 5 years ago
Note that you could also do this with a regular expression, calling Match
and then iterating over the captures.
Your example code iterates over the source string multiple times. That's almost certainly not going to happen - it's far more likely we iterate through the source exactly once, something along these lines:
find start of next match:
copy after end of previous match to before next match
copy replacement of next match
update index to start next find after replacement
if no replacements found, return original string
...note that, unlike your example code, this also prevents replacements from being considered for subsequent matches (which can be the cause of subtle errors).
The big question then is whether we want to check the match list for duplicate or subsequence matches? I'm kinda thinking "no", for performance reasons if nothing else. I'd class it as a programmer error, not an "expected" error condition, though, which makes me want to squawk about it.
@Clockwork-Muse I am not saying this is the perfect implementation, but it is a fast solution for me and I gave it as a proof of concept. I first thought of building a new string with replacements, but this needs to sort the result positions first befor doing the replacment(rebuild), which is not a right this to do becuase changes the replacemnet order can give differnet results, if some matches overlap or if one replacement can create a mach for a next replacement! So, the idea here is to preserve the replacement order, which I done by using StringBuilder. Here we have two cases:
note that, unlike your example code, this also prevents replacements from being considered for subsequent matches (which can be the cause of subtle errors).
As I said before, this can be a desired feature, such in this case:
dim y = x.Replace((vbCrLf, vbLf), (vbCr, vbLf), (vbLf & vbLf, vbLf))
So, maybe we need a new overload: Which is more compact and more efficient. It can be implemented with this extension method:
Function Replace(Overlapping As Boolean ,ParamArray repPairs() As (repStr As String, repWithStr As String))
Overlapping can be considered False by default in the first overload.
Note: To achieve the overlapping, you should not advance the current position after performing a replacement. Instead you check next repStr starting from: 'Pos = Max(0, Pos-repStr.length)'
Note: It will be better if we implemented a multi replacement method in the StringBuilder class, then used it to implement the String multi replacement method.
I am going to guess the scenario is something like data cleansing, and you almost never want "chained" replacements, and if you do, you can stick with chaining string.Replace
or some custom approach.
Assuming you do not, then string.Replace
is not necessarily going to work correctly, as well as being less than ideally efficient (depending on the sizes of your data). Copying into a StringBuilder
does not help avoid the chained replacement problem, and you also have that initial copy. Also StringBuilder is optimized for Append
, it is not the ideal way to repeatedly Replace
(if you presized so it's not chunky, it will be less efficient even than string, and if it is chunky, each chunk with replacements will get copied, possibly to arbitrary size).
The ideal implemention (assuming you start with a string) would be to iterate over it (probably using IndexOf
with an advancing index, since that is highly optimized internally), append to a StringBuilder
(maybe actually ValueStringBuilder
internally), then convert that to a string. As an optimization, you could search for a match before bothering with the builder (if not certainly at least it should return the original if there are no matches, as string.Replace
does today). Possibly you could find one or more common prefixes if any in order to avoid so many IndexOf
calls to find the "nearest".
It would seem not unreasonable to me to add that to string
, in the same way we offer an overload of string.Split
that takes an array of separators since it's more efficient than calling it repeatedly for each.
However if you have the data in a large string, and want to obtain a large string, you may not be particularly concerned with efficiency in the first place. If you were, you would be manipulating buffers already. And if your strings are not so large, the allocations are not so large either. @VBAndCs how large are your buffers and why are you using strings?
@danmosemsft
I am creating what can be called a ZML compiler to generate cshtml pages from zml pages. ZML is a tag representaion of C# statements. So, teh zml string can be as large as the Razor page can be. I don't expect this to be a problem in this case, especially that I generate the cshtml files at the srartup of the app, and only compile pages that have changed. Of course this can take long time if the no of pages is large and their size is huge. It is always safer to optimize.
But, on athoer hand, It will be nice if the String class shorten the commonly used codes. For instance, I alwys wonder why I have to write: S.Replace("test","")
not S.Remove("test")
?
By the way, XElement behaves bad with xml nodes that contains literal strings beside xml elements, as it deletes all white spaces including line separators in this case, regardless of the reader ane writer settings!
So, I hade to replace each c# statement with a placeholder tag in the form
Dim cshtml = <cshtml>
<p>Test</p>
@foreach (var i in items)
{
<h>item: </h>
<div>
<p>i</p>
</div>
}
<p>End Test</p>
</cshtml>
Dim S = cshtml.ToString()
This is the resulting string :
<cshtml>
<p>Test</p>
@foreach (var i in items)
{
<h>item: </h><div><p>i</p></div>
}
<p>End Test</p></cshtml>
The format is damaged because the c# string!. If you omitted it, every thing will work fine! Why?
@VBAndCs
XElement behaves bad with xml nodes that contains literal strings beside xml elements
If you believe there is an issue in XElement, please open a separate issue.
Of course this can take long time if the no of pages is large and their size is huge. It is always safer to optimize.
Not necessarily, particularly when that change impacts everyone (even if they don't use it - binaries get larger, there is more to maintain, etc). The question is whether there are common cases where this proposed API is worth it. I think there probably is but it's not certainly true.
But, on athoer hand, It will be nice if the String class shorten the commonly used codes. For instance, I alwys wonder why I have to write: S.Replace("test","") not S.Remove("test")?
That would need API review, you can open a separate proposal (see format). Again one of the questions will be: is this common enough and useful enough to be worth the cost of implementation and maintenance, binary size, documentation, and potential user confusion (which should I use?). In this case I think probably not as it's trivial to replace with empty string. In the original case for this issue, the best argument is that we could do it more efficiently than the user can easily do.
@danmosemsft
I implemented the Replace function using MyLinkedList
public static string Replace(string s, params (string findStr, string repWith)[] replacements)
{
if (string.IsNullOrEmpty(s) || replacements == null || replacements.Length == 0)
return s;
var newStr = new MyLinkedList<char>(s);
int strLenght = s.Length;
foreach (var rep in replacements)
{
int start = 0;
while (start < strLenght)
{
var L = rep.findStr.Length;
if (L + start > strLenght)
break;
int pos = 0;
for (; pos < L; pos++)
{
if (rep.findStr[pos] != newStr[start + pos])
break;
}
if (pos == L) // A match found
{
newStr.Replace(start, L, rep.repWith);
strLenght = newStr.Count;
start += rep.repWith.Length -1;
}
start += 1;
}
}
return new string(newStr.ToArray());
}
using System;
using System.Collections.Generic;
class MyLinkedList<ListType> where ListType : IComparable
{
public class Cell
{
public ListType Value;
public Cell NextCell;
public Cell()
{
}
public Cell(ListType Value, Cell NextCell)
{
this.Value = Value;
this.NextCell = NextCell;
}
}
public MyLinkedList()
{
}
public MyLinkedList(IEnumerable<ListType> items)
{
Add(items);
}
private Cell FirstCell, LastCell;
public Cell Add(ListType Item)
{
Cell C = new Cell() { Value = Item };
if (pCount == 0)
// يجب إنشاء نسخة جديدة من خانة البداية
FirstCell = C;
else
LastCell.NextCell = C;
LastCell = C;
pCount += 1;
return C;
}
private int pCount = 0;
public int Count => pCount;
private int lastReachedCellIndex = -1;
private Cell lastReachedCell = null;
private Cell GetCell(int Index)
{
try
{
if (Index < 0 || Index >= pCount)
throw new Exception("رقم العنصر غير صالح");
Cell c; int i;
var step = Index - lastReachedCellIndex;
if (lastReachedCell != null && step>-1 && step < Index)
{
c = lastReachedCell;
i = lastReachedCellIndex;
}
else
{
c = FirstCell;
i = 0;
}
while (i != Index)
{
c = c.NextCell;
i += 1;
}
lastReachedCell = c;
lastReachedCellIndex = i;
return c;
}
finally
{
}
}
public ListType this[int Index]
{
get
{
try
{
return GetCell(Index).Value;
}
finally { }
}
set
{
try
{
if (Index == pCount)
Add(value);
else
GetCell(Index).Value = value;
}
finally { }
}
}
public void Add(IEnumerable <ListType> Items)
{
foreach (var item in Items)
Add(item);
}
public void Add(MyLinkedList<ListType> NewList)
{
if (pCount == 0)
{
FirstCell = NewList.FirstCell;
LastCell = NewList.LastCell;
}
else
{
LastCell.NextCell = NewList.FirstCell;
LastCell = NewList.LastCell;
}
pCount += NewList.pCount;
}
public void CopyList(MyLinkedList<ListType> NewList)
{
Cell C = NewList.FirstCell;
while (C != null)
{
this.Add(C.Value);
C = C.NextCell;
}
}
public MyLinkedList<ListType> SubList(int St)
{
try
{
var L = new MyLinkedList<ListType>();
L.FirstCell = GetCell(St);
L.LastCell = LastCell;
L.pCount = pCount - St;
return L;
}
finally { }
}
public void Insert(ListType Item, int Index)
{
try
{
if (Index == pCount)
{
lastReachedCell = Add(Item);
lastReachedCellIndex = Index;
}
else
{
Cell C = new Cell(Item, GetCell(Index));
if (Index == 0)
FirstCell = C;
else
GetCell(Index - 1).NextCell = C;
pCount += 1;
lastReachedCell = C;
lastReachedCellIndex = Index;
}
}
finally { }
}
public void InsertList(MyLinkedList<ListType> NewList, int Index)
{
try
{
if (Index == pCount)
Add(NewList);
else
{
if (Index == 0)
FirstCell = NewList.FirstCell;
else
GetCell(Index - 1).NextCell = NewList.FirstCell;
}
NewList.LastCell = GetCell(Index);
pCount += NewList.Count;
}
finally
{
lastReachedCell = null;
lastReachedCellIndex = -1;
}
}
public void CopyList(MyLinkedList<ListType> NewList, int Index)
{
Cell C = NewList.FirstCell;
while (C != null)
{
Insert(C.Value, Index);
C = C.NextCell;
}
}
public void InsertArray(ListType[] Items, int Index)
{
var L = new MyLinkedList<ListType>();
L.Add(Items);
InsertList(L, Index);
}
public void Clear()
{
FirstCell = null;
LastCell = null;
pCount = 0;
lastReachedCell = null;
lastReachedCellIndex = -1;
}
public void Replace(int atStart, int length, IEnumerable<ListType> items )
{
var c = GetCell(atStart);
var i = 0;
foreach (var item in items)
{
if (i < length)
{
c.Value = item;
c = c.NextCell;
}
else
Insert(item, atStart + i);
i++;
}
if (i < length)
RemoveRange(atStart+ i, length - i );
}
public void RemoveAt(int Index)
{
try
{
if (Index == 0) // حذف أول عنصر
if (pCount == 1)
{
// العنصر الأول هو العنصر الأخير وسيتم حذفه
FirstCell = null;
LastCell = null;
}
else // جعل العنصر الثاني هو العنصر الأول
FirstCell = FirstCell.NextCell;
else if (Index == pCount - 1)
{
// حذف آخر عنصر بجعل العنصر السابق له هو العنصر الأخير
LastCell = GetCell(Index - 1);
LastCell.NextCell = null;
}
else
{//حذف العنصر الحالي، بجعل العنصر السابق يشير إلى العنصر التالي
var prev = GetCell(Index - 1); // العنصر السابق
prev.NextCell = prev.NextCell.NextCell;
}
pCount -= 1;
}
finally
{
lastReachedCell = null;
lastReachedCellIndex = -1;
}
}
public void RemoveLastItem()
{
try
{
if (pCount == 1)
{
FirstCell = null;
LastCell = null;
lastReachedCell = null;
lastReachedCellIndex = -1;
}
else
{
LastCell = GetCell(pCount - 2);
LastCell.NextCell = null;
}
pCount -= 1;
}
finally { }
}
public int RemoveRange(int St, int DelCount)
{
try
{
if (St + DelCount > pCount)
DelCount = pCount - St;
int nxtId = St + DelCount;
if (St == 0)
if (pCount == DelCount)
{
FirstCell = null;
LastCell = null;
}
else
FirstCell = GetCell(DelCount);
else if (nxtId == pCount)
{
LastCell = GetCell(St - 1);
LastCell.NextCell = null;
}
else
GetCell(St - 1).NextCell = GetCell(nxtId);
pCount -= DelCount;
return DelCount;
}
finally
{
lastReachedCell = null;
lastReachedCellIndex = -1;
}
}
private Cell pCurCell;
public ListType CurItem
{
get
{
try
{
if (pCurCell == null)
throw new Exception("لا توجد خانة حالية");
return pCurCell.Value;
}
finally { }
}
}
public bool MoveFirst()
{
if (FirstCell == null) return false;
pCurCell = FirstCell;
return true;
}
public bool MoveNext()
{
if (pCurCell == null)
throw new Exception("لا توجد خانة حالية");
if (pCurCell.NextCell == null) return false;
pCurCell = pCurCell.NextCell;
return true;
}
public bool MoveLast()
{
if (LastCell == null) return false;
pCurCell = LastCell;
return true;
}
public bool MoveTo(int Index)
{
try
{
pCurCell = GetCell(Index);
return true;
}
catch
{
return false;
}
}
public ListType FirstItem
{
get
{
if (FirstCell == null)
throw new Exception("القائمة فارغة");
return FirstCell.Value;
}
set
{
if (FirstCell == null)
throw new Exception("القائمة فارغة");
FirstCell.Value = value;
}
}
public ListType LastItem
{
get
{
if (LastCell == null)
throw new Exception("القائمة فارغة");
return LastCell.Value;
}
set
{
if (LastCell == null)
throw new Exception("القائمة فارغة");
LastCell.Value = value;
}
}
public ListType[] ToArray(int St = 0, int En = -1)
{
try
{
if (En == -1 || En >= pCount)
En = pCount - 1;
ListType[] A = new ListType[En - St + 1];
Cell C = GetCell(St);
for (int i = 0; i <= En - St; i++)
{
A[i] = C.Value;
C = C.NextCell;
}
return A;
}
finally { }
}
}
@VBAndCs what is the "best" solution of course depends on the use case. Your approach looks like it would allocate a lot (a new reference type for every char) and I think would be quite slow in general.
I am confident that a reasonable implementation could be created, there is no need to create one unless and until the API is approved. We have a formal process for that. You would want to edit your top post to follow the format and then discussion can happen.
So, we are going back to square 1: Using String.Replace is the best solution as it uses the SB to build the new string. There is no easy way to do multiple replacements in one iteration. I tried it and give me unwanted results, because I expect to apply the first replace first then the second. The firs t replace can affect the result of the second one, not only by providing new matches, but also by eliminating possible ones! For ex: "abcabc".Replace("ab", "x", "ca", "y") It should yield "xcxc". If the two replacements are applied simultaneously at each pos, the result would be: "xybc"! So, this is best implementation:
Function Replace(s As String, ParamArray repPairs() As (repStr As String, repWithStr As String)) As String
For Each x In repPairs
s = s.Replace(x.repStr, x.repWithStr)
Next
Return s
End Function
It just offers a new compact syntax for the Replace method, with no performance improvements!
...I feel like what you really want to make for your code is a separate parser, then build (and write) the HTML DOM. Which would save you trying to do string replacements (which are not only slow, but can bite you in these kinds of situations, as you're discovering), among other things.
Otherwise, you might have better (overall) results if you did a regex match, then iterated over those results (being careful to look for overlapping matches).
@Clockwork-Muse Maybe you are right. I started this as a proof of concept, but it is getting more complicated as I move forward. I will stuck to the my ZML proof of concept until I cover all necessary code tags, and make a full zml version of the MS eShopOnWeb app, then if the community likes the idea of using xml razor instead of C# and VB.NET Razor, we can create a reliable zml implementation. Note: Things works fine for me right now, but I am dealing with small no of pages with small sizes.
@VBAndCs to make progress on this issue, we will need the top post edited to match the API review guidelines, as linked above.
After this discussion and trials, String.Replace seems the best choice for now, so I am closing this. Thanks.
...I mean, even if it doesn't work for your use case, it's probably something we should look at.
Ok, But I made a fast test and it turned out that the successive calls to the String.Replace (which uses SB to build the result) is faster than using successive StringBuilder.Replace or LinkiedList
I wrote this code:
Using successive replacements creates many visions of the string, which is less efficient, so I wrote it using tuples like this:
Which is more compact and more efficient. It can be implemented with this extension method:
I suggest to add this Replace overload to the string class.