dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

String.Replace overload to do multiple replacements with an array of tuples #29249

Open VBAndCs opened 5 years ago

VBAndCs commented 5 years ago

I wrote this code:

Return xml.ToString(SaveOptions.DisableFormatting).
            Replace("<zml>", "").Replace("</zml>", "").
            Replace("__ltn__", "<").Replace("__gtn__", ">").
            Replace("__amp__", "&")

Using successive replacements creates many visions of the string, which is less efficient, so I wrote it using tuples like this:

Return xml.ToString(SaveOptions.DisableFormatting).
            Replace(
                     ("<zml>", ""), ("</zml>", ""),
                     ("__ltn__", "<"), ("__gtn__", ">"),
                     ("__amp__", "&")
                   )

Which is more compact and more efficient. It can be implemented with this extension method:

<Extension>
Function Replace(s As String, ParamArray repPairs() As (repStr As String, repWithStr As String))
        Dim sb As New Text.StringBuilder(s)
        For Each x In repPairs
            sb.Replace(x.repStr, x.repWithStr)
        Next
        Return sb.ToString()
End Function

I suggest to add this Replace overload to the string class.

Clockwork-Muse commented 5 years ago

Note that you could also do this with a regular expression, calling Match and then iterating over the captures.

Clockwork-Muse commented 5 years ago

Your example code iterates over the source string multiple times. That's almost certainly not going to happen - it's far more likely we iterate through the source exactly once, something along these lines:

find start of next match:
    copy after end of previous match to before next match
    copy replacement of next match
    update index to start next find after replacement
if no replacements found, return original string

...note that, unlike your example code, this also prevents replacements from being considered for subsequent matches (which can be the cause of subtle errors).

The big question then is whether we want to check the match list for duplicate or subsequence matches? I'm kinda thinking "no", for performance reasons if nothing else. I'd class it as a programmer error, not an "expected" error condition, though, which makes me want to squawk about it.

VBAndCs commented 5 years ago

@Clockwork-Muse I am not saying this is the perfect implementation, but it is a fast solution for me and I gave it as a proof of concept. I first thought of building a new string with replacements, but this needs to sort the result positions first befor doing the replacment(rebuild), which is not a right this to do becuase changes the replacemnet order can give differnet results, if some matches overlap or if one replacement can create a mach for a next replacement! So, the idea here is to preserve the replacement order, which I done by using StringBuilder. Here we have two cases:

  1. For short strings, the StringBuilder performance will not be better than the Stringm but we strill have an advantage of using a compact "Replace" syntax.
  2. For long strings, the StringBuilder performance is better, becuase the replacements will affect only small no of internal chuncks, unless in worest case, every chunck has one or more matches!
VBAndCs commented 5 years ago

note that, unlike your example code, this also prevents replacements from being considered for subsequent matches (which can be the cause of subtle errors).

As I said before, this can be a desired feature, such in this case: dim y = x.Replace((vbCrLf, vbLf), (vbCr, vbLf), (vbLf & vbLf, vbLf))

So, maybe we need a new overload: Which is more compact and more efficient. It can be implemented with this extension method:

Function Replace(Overlapping As Boolean ,ParamArray repPairs() As (repStr As String, repWithStr As String))

Overlapping can be considered False by default in the first overload.

Note: To achieve the overlapping, you should not advance the current position after performing a replacement. Instead you check next repStr starting from: 'Pos = Max(0, Pos-repStr.length)'

VBAndCs commented 5 years ago

Note: It will be better if we implemented a multi replacement method in the StringBuilder class, then used it to implement the String multi replacement method.

danmoseley commented 5 years ago

I am going to guess the scenario is something like data cleansing, and you almost never want "chained" replacements, and if you do, you can stick with chaining string.Replace or some custom approach.

Assuming you do not, then string.Replace is not necessarily going to work correctly, as well as being less than ideally efficient (depending on the sizes of your data). Copying into a StringBuilder does not help avoid the chained replacement problem, and you also have that initial copy. Also StringBuilder is optimized for Append, it is not the ideal way to repeatedly Replace (if you presized so it's not chunky, it will be less efficient even than string, and if it is chunky, each chunk with replacements will get copied, possibly to arbitrary size).

The ideal implemention (assuming you start with a string) would be to iterate over it (probably using IndexOf with an advancing index, since that is highly optimized internally), append to a StringBuilder (maybe actually ValueStringBuilder internally), then convert that to a string. As an optimization, you could search for a match before bothering with the builder (if not certainly at least it should return the original if there are no matches, as string.Replace does today). Possibly you could find one or more common prefixes if any in order to avoid so many IndexOf calls to find the "nearest".

It would seem not unreasonable to me to add that to string, in the same way we offer an overload of string.Split that takes an array of separators since it's more efficient than calling it repeatedly for each.

However if you have the data in a large string, and want to obtain a large string, you may not be particularly concerned with efficiency in the first place. If you were, you would be manipulating buffers already. And if your strings are not so large, the allocations are not so large either. @VBAndCs how large are your buffers and why are you using strings?

VBAndCs commented 5 years ago

@danmosemsft I am creating what can be called a ZML compiler to generate cshtml pages from zml pages. ZML is a tag representaion of C# statements. So, teh zml string can be as large as the Razor page can be. I don't expect this to be a problem in this case, especially that I generate the cshtml files at the srartup of the app, and only compile pages that have changed. Of course this can take long time if the no of pages is large and their size is huge. It is always safer to optimize. But, on athoer hand, It will be nice if the String class shorten the commonly used codes. For instance, I alwys wonder why I have to write: S.Replace("test","") not S.Remove("test")?

VBAndCs commented 5 years ago

By the way, XElement behaves bad with xml nodes that contains literal strings beside xml elements, as it deletes all white spaces including line separators in this case, regardless of the reader ane writer settings! So, I hade to replace each c# statement with a placeholder tag in the form where N is an integer. At the end, replace all the tags with there values. This can be a lot of replacements to do, and in this case they are done one replacement at a time in a for loop! I am sure this will cause problems later, but I am busy with completing ZML tags now. Is ther any thing you can do to solve the XElemnt issue? Try test it with this:

Dim cshtml = <cshtml>
                             <p>Test</p>
                             @foreach (var i in items)
                             {
                             <h>item: </h>
                             <div>
                                 <p>i</p>
                             </div>

                             }
                                 <p>End Test</p>
                         </cshtml>

            Dim S = cshtml.ToString()

This is the resulting string :

<cshtml>
  <p>Test</p>
                             @foreach (var i in items)
                             {
                             <h>item: </h><div><p>i</p></div>

                             }
                                 <p>End Test</p></cshtml>

The format is damaged because the c# string!. If you omitted it, every thing will work fine! Why?

danmoseley commented 5 years ago

@VBAndCs

XElement behaves bad with xml nodes that contains literal strings beside xml elements

If you believe there is an issue in XElement, please open a separate issue.

Of course this can take long time if the no of pages is large and their size is huge. It is always safer to optimize.

Not necessarily, particularly when that change impacts everyone (even if they don't use it - binaries get larger, there is more to maintain, etc). The question is whether there are common cases where this proposed API is worth it. I think there probably is but it's not certainly true.

But, on athoer hand, It will be nice if the String class shorten the commonly used codes. For instance, I alwys wonder why I have to write: S.Replace("test","") not S.Remove("test")?

That would need API review, you can open a separate proposal (see format). Again one of the questions will be: is this common enough and useful enough to be worth the cost of implementation and maintenance, binary size, documentation, and potential user confusion (which should I use?). In this case I think probably not as it's trivial to replace with empty string. In the original case for this issue, the best argument is that we could do it more efficiently than the user can easily do.

VBAndCs commented 5 years ago

@danmosemsft I implemented the Replace function using MyLinkedList. I optimized this linked for successive calls to the indexer, and added a replace method to replace a range of items with another range. I think this is the best solution. The linked list cell holds only a ref to the next cell beside the char value. This list is single linked, which has a down side: it needs to traverse the list from start obnce for each replacement to do. Otherwise, moving to the next index if optimized.

        public static string Replace(string s, params (string findStr, string repWith)[] replacements)
        {
            if (string.IsNullOrEmpty(s) || replacements == null || replacements.Length == 0)
                return s;

            var newStr = new MyLinkedList<char>(s);
            int strLenght = s.Length;

            foreach (var rep in replacements)
            {
                int start = 0;
                while (start < strLenght)
                {
                    var L = rep.findStr.Length;
                    if (L + start > strLenght)
                        break;

                    int pos = 0;
                    for (; pos < L; pos++)
                    {
                        if (rep.findStr[pos] != newStr[start + pos])
                            break;
                    }

                    if (pos == L) // A match found
                    {
                        newStr.Replace(start, L, rep.repWith);
                        strLenght = newStr.Count;
                        start += rep.repWith.Length -1;
                    }
                    start += 1;
                }
            }

            return new string(newStr.ToArray());
        }
using System;
using System.Collections.Generic;

class MyLinkedList<ListType> where ListType : IComparable
{
    public class Cell
    {
        public ListType Value;
        public Cell NextCell;
        public Cell()
        {

        }

        public Cell(ListType Value, Cell NextCell)
        {
            this.Value = Value;
            this.NextCell = NextCell;
        }

    }

    public MyLinkedList()
    {
    }

    public MyLinkedList(IEnumerable<ListType> items)
    {
        Add(items);
    }

    private Cell FirstCell, LastCell;

    public Cell Add(ListType Item)
    {
        Cell C = new Cell() { Value = Item };
        if (pCount == 0)
            // يجب إنشاء نسخة جديدة من خانة البداية
            FirstCell = C;
        else
            LastCell.NextCell = C;

        LastCell = C;
        pCount += 1;
        return C;
    }

    private int pCount = 0;
    public int Count => pCount;

    private int lastReachedCellIndex = -1;
    private Cell lastReachedCell = null;

    private Cell GetCell(int Index)
    {
        try
        {
            if (Index < 0 || Index >= pCount)
                throw new Exception("رقم العنصر غير صالح");

            Cell c; int i;
            var step = Index - lastReachedCellIndex;
            if (lastReachedCell != null && step>-1 && step < Index)
            {
                c = lastReachedCell;
                i = lastReachedCellIndex;
            }
            else
            {
                c = FirstCell;
                i = 0;
            }

            while (i != Index)
            {
                c = c.NextCell;
                i += 1;
            }
            lastReachedCell = c;
            lastReachedCellIndex = i;
            return c;
        }
        finally
        {
        }
    }

    public ListType this[int Index]
    {
        get
        {
            try
            {
                return GetCell(Index).Value;
            }
            finally { }
        }
        set
        {
            try
            {
                if (Index == pCount)
                    Add(value);
                else
                    GetCell(Index).Value = value;
            }
            finally { }
        }
    }

    public void Add(IEnumerable <ListType> Items)
    {
        foreach (var item in Items)
            Add(item);
    }

    public void Add(MyLinkedList<ListType> NewList)
    {
        if (pCount == 0)
        {
            FirstCell = NewList.FirstCell;
            LastCell = NewList.LastCell;
        }
        else
        {
            LastCell.NextCell = NewList.FirstCell;
            LastCell = NewList.LastCell;
        }
        pCount += NewList.pCount;
    }

    public void CopyList(MyLinkedList<ListType> NewList)
    {
        Cell C = NewList.FirstCell;
        while (C != null)
        {
            this.Add(C.Value);
            C = C.NextCell;
        }
    }

    public MyLinkedList<ListType> SubList(int St)
    {
        try
        {
            var L = new MyLinkedList<ListType>();
            L.FirstCell = GetCell(St);
            L.LastCell = LastCell;
            L.pCount = pCount - St;
            return L;
        }
        finally { }
    }

    public void Insert(ListType Item, int Index)
    {
        try
        {
            if (Index == pCount)
            {
                lastReachedCell = Add(Item);
                lastReachedCellIndex = Index;
            }
            else
            {
                Cell C = new Cell(Item, GetCell(Index));
                if (Index == 0)
                    FirstCell = C;
                else
                    GetCell(Index - 1).NextCell = C;

                pCount += 1;
                lastReachedCell = C;
                lastReachedCellIndex = Index;
            }
        }
        finally { }
    }

    public void InsertList(MyLinkedList<ListType> NewList, int Index)
    {
        try
        {
            if (Index == pCount)
                Add(NewList);
            else
            {
                if (Index == 0)
                    FirstCell = NewList.FirstCell;
                else
                    GetCell(Index - 1).NextCell = NewList.FirstCell;
            }

            NewList.LastCell = GetCell(Index);
            pCount += NewList.Count;
        }
        finally
        {
            lastReachedCell = null;
            lastReachedCellIndex = -1;
        }
    }

    public void CopyList(MyLinkedList<ListType> NewList, int Index)
    {
        Cell C = NewList.FirstCell;
        while (C != null)
        {
            Insert(C.Value, Index);
            C = C.NextCell;
        }
    }

    public void InsertArray(ListType[] Items, int Index)
    {
        var L = new MyLinkedList<ListType>();
        L.Add(Items);
        InsertList(L, Index);
    }

    public void Clear()
    {
        FirstCell = null;
        LastCell = null;
        pCount = 0;
        lastReachedCell = null;
        lastReachedCellIndex = -1;
    }

    public void Replace(int atStart, int length, IEnumerable<ListType> items )
    {
        var c = GetCell(atStart);
        var i = 0;
        foreach (var item in items)
        {
            if (i < length)
            {
                c.Value = item;
                c = c.NextCell;
            }
            else
                Insert(item, atStart + i);

            i++;
        }

        if (i < length)
            RemoveRange(atStart+ i, length - i );

    }

    public void RemoveAt(int Index)
    {
        try
        {
            if (Index == 0) // حذف أول عنصر
                if (pCount == 1)
                {
                    // العنصر الأول هو العنصر الأخير وسيتم حذفه
                    FirstCell = null;
                    LastCell = null;
                }
                else // جعل العنصر الثاني هو العنصر الأول
                    FirstCell = FirstCell.NextCell;
            else if (Index == pCount - 1)
            {
                // حذف آخر عنصر بجعل العنصر السابق له هو العنصر الأخير
                LastCell = GetCell(Index - 1);
                LastCell.NextCell = null;
            }
            else
            {//حذف العنصر الحالي، بجعل العنصر السابق يشير إلى العنصر التالي
                var prev = GetCell(Index - 1); // العنصر السابق
                prev.NextCell = prev.NextCell.NextCell;
            }
            pCount -= 1;
        }
        finally
        {
            lastReachedCell = null;
            lastReachedCellIndex = -1;
        }
    }

    public void RemoveLastItem()
    {
        try
        {
            if (pCount == 1)
            {
                FirstCell = null;
                LastCell = null;
                lastReachedCell = null;
                lastReachedCellIndex = -1;
            }
            else
            {
                LastCell = GetCell(pCount - 2);
                LastCell.NextCell = null;
            }
            pCount -= 1;
        }
        finally { }
    }

    public int RemoveRange(int St, int DelCount)
    {
        try
        {
            if (St + DelCount > pCount)
                DelCount = pCount - St;
            int nxtId = St + DelCount;
            if (St == 0)
                if (pCount == DelCount)
                {
                    FirstCell = null;
                    LastCell = null;
                }
                else
                    FirstCell = GetCell(DelCount);

            else if (nxtId == pCount)
            {
                LastCell = GetCell(St - 1);
                LastCell.NextCell = null;
            }
            else
                GetCell(St - 1).NextCell = GetCell(nxtId);
            pCount -= DelCount;
            return DelCount;
        }
        finally
        {
            lastReachedCell = null;
            lastReachedCellIndex = -1;
        }
    }

    private Cell pCurCell;
    public ListType CurItem
    {
        get
        {
            try
            {
                if (pCurCell == null)
                    throw new Exception("لا توجد خانة حالية");
                return pCurCell.Value;
            }
            finally { }
        }
    }

    public bool MoveFirst()
    {
        if (FirstCell == null) return false;
        pCurCell = FirstCell;
        return true;
    }

    public bool MoveNext()
    {
        if (pCurCell == null)
            throw new Exception("لا توجد خانة حالية");
        if (pCurCell.NextCell == null) return false;
        pCurCell = pCurCell.NextCell;
        return true;
    }

    public bool MoveLast()
    {
        if (LastCell == null) return false;
        pCurCell = LastCell;
        return true;
    }

    public bool MoveTo(int Index)
    {
        try
        {
            pCurCell = GetCell(Index);
            return true;
        }
        catch
        {
            return false;
        }
    }

    public ListType FirstItem
    {
        get
        {
            if (FirstCell == null)
                throw new Exception("القائمة فارغة");
            return FirstCell.Value;
        }
        set
        {
            if (FirstCell == null)
                throw new Exception("القائمة فارغة");
            FirstCell.Value = value;
        }
    }

    public ListType LastItem
    {
        get
        {
            if (LastCell == null)
                throw new Exception("القائمة فارغة");
            return LastCell.Value;
        }
        set
        {
            if (LastCell == null)
                throw new Exception("القائمة فارغة");
            LastCell.Value = value;
        }
    }

    public ListType[] ToArray(int St = 0, int En = -1)
    {
        try
        {
            if (En == -1 || En >= pCount)
                En = pCount - 1;
            ListType[] A = new ListType[En - St + 1];
            Cell C = GetCell(St);
            for (int i = 0; i <= En - St; i++)
            {
                A[i] = C.Value;
                C = C.NextCell;
            }
            return A;
        }
        finally { }
    }
}
danmoseley commented 5 years ago

@VBAndCs what is the "best" solution of course depends on the use case. Your approach looks like it would allocate a lot (a new reference type for every char) and I think would be quite slow in general.

I am confident that a reasonable implementation could be created, there is no need to create one unless and until the API is approved. We have a formal process for that. You would want to edit your top post to follow the format and then discussion can happen.

VBAndCs commented 5 years ago

So, we are going back to square 1: Using String.Replace is the best solution as it uses the SB to build the new string. There is no easy way to do multiple replacements in one iteration. I tried it and give me unwanted results, because I expect to apply the first replace first then the second. The firs t replace can affect the result of the second one, not only by providing new matches, but also by eliminating possible ones! For ex: "abcabc".Replace("ab", "x", "ca", "y") It should yield "xcxc". If the two replacements are applied simultaneously at each pos, the result would be: "xybc"! So, this is best implementation:

    Function Replace(s As String, ParamArray repPairs() As (repStr As String, repWithStr As String)) As String
        For Each x In repPairs
            s = s.Replace(x.repStr, x.repWithStr)
        Next
        Return s
    End Function

It just offers a new compact syntax for the Replace method, with no performance improvements!

Clockwork-Muse commented 5 years ago

...I feel like what you really want to make for your code is a separate parser, then build (and write) the HTML DOM. Which would save you trying to do string replacements (which are not only slow, but can bite you in these kinds of situations, as you're discovering), among other things.

Otherwise, you might have better (overall) results if you did a regex match, then iterated over those results (being careful to look for overlapping matches).

VBAndCs commented 5 years ago

@Clockwork-Muse Maybe you are right. I started this as a proof of concept, but it is getting more complicated as I move forward. I will stuck to the my ZML proof of concept until I cover all necessary code tags, and make a full zml version of the MS eShopOnWeb app, then if the community likes the idea of using xml razor instead of C# and VB.NET Razor, we can create a reliable zml implementation. Note: Things works fine for me right now, but I am dealing with small no of pages with small sizes.

danmoseley commented 5 years ago

@VBAndCs to make progress on this issue, we will need the top post edited to match the API review guidelines, as linked above.

VBAndCs commented 5 years ago

After this discussion and trials, String.Replace seems the best choice for now, so I am closing this. Thanks.

Clockwork-Muse commented 5 years ago

...I mean, even if it doesn't work for your use case, it's probably something we should look at.

VBAndCs commented 5 years ago

Ok, But I made a fast test and it turned out that the successive calls to the String.Replace (which uses SB to build the result) is faster than using successive StringBuilder.Replace or LinkiedList appraoch!