dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

System.Xml.Linq.XElement.Parse very low performance with incorrect XML #68850

Open IgorMenshikov opened 2 years ago

IgorMenshikov commented 2 years ago

Description

I am not sure if this is a correct repo but System.Xml.Linq is a part of NET Core, so I post it here.

System.Xml.Linq.XElement.Parse has very low performance if pass incorrect XML. With correct XML the code can do 5000+ parse calls per second. Incorrect XML results in 10s per second.

Maybe it is because it throws an exception on each call.

I have 1000s of XML coming from external data source and cannot control if they are correct. So, low performance just make it impossible to process data.

My test code is below. I run it on NET Core 6.0.4.

        [Fact(Timeout = 1000)]
        public static void ParseFail()
        {
            var xml = "<d /><d><C1>01</C1></d>";
            const int iterCount = 5000;

            for (int i = 0; i < iterCount; i++)
            {
                try
                {
                    System.Xml.Linq.XElement.Parse(xml);
                }
                catch (Exception)
                {
                }
            }
        }
ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-xml See info in area-owners.md if you want to be subscribed.

Issue Details
### Description I am not sure if this is a correct repo but System.Xml.Linq is a part of NET Core, so I post it here. System.Xml.Linq.XElement.Parse has very low performance if pass incorrect XML. With correct XML the code can do 5000+ parse calls per second. Incorrect XML results in 10s per second. Maybe it is because it throws an exception on each call. I have 1000s of XML coming from external data source and cannot control if they are correct. So, low performance just make it impossible to process data. My test code is below. I run it on NET Core 6.0.4. ```C# [Fact(Timeout = 1000)] public static void ParseFail() { var xml = "01"; const int iterCount = 5000; for (int i = 0; i < iterCount; i++) { try { System.Xml.Linq.XElement.Parse(xml); } catch (Exception) { } } } ```
Author: IgorMenshikov
Assignees: -
Labels: `area-System.Xml`, `tenet-performance`, `untriaged`
Milestone: -
krwq commented 2 years ago

@IgorMenshikov we usually don't optimize for fail paths but I can see this being problematic in a web app. We currently don't have plans to improve this but I'd be happy to review any contributions. Would you be interested in digging a bit to find out what causes that to be slow? You can try to start searching for workarounds, i.e.: pass input to XmlReader and simply do read loop first for initial validation (I'm not sure if problem is specific to XmlReader or XLinq) and see if that one can find errors quicker - assuming XmlReader can do this much faster it would decrease perf slightly overall but at least you wouldn't suffer heavy hit for bad actors.

krwq commented 2 years ago

i.e. calling simply method like this before calling XElement.Parse reduces validation time by 30% for me, not sure if that would be enough for you (that's around 3-4 times slower than for a good case with Validate and around 5-6 times slower than original version without Validate - tested on similarly long XML):

void Validate(string xml)
{
    using (XmlReader reader = XmlReader.Create(new StringReader(xml)))
    {
        while (reader.Read()) ;
    }
}

also consider blocking or add some throttling for bad actors

IgorMenshikov commented 2 years ago

My test with 1000 iterations:

  1. XElement.Parse with good XML = 40 ms
  2. XElement.Parse with bad XML = 3.9 sec
  3. XmlReader.Create with bad XML = 3.5/4.3 sec

So, I do not see a big difference between XElement.Parse and XmlReader.Create.

But difference between success and fail paths is 100 times.

I have to process millions of XMLs. Usually quite fast but with many wrong XML it may take many minutes.

stephentoub commented 2 years ago

Profiling it shows that ~85% of the time is spent throwing the failure exception. So you would need to come up with a way that communicated failure via a means other than exceptions.

IgorMenshikov commented 2 years ago

If exceptions take 85%, it would be perfect to have something like TryParse as we have with other C# stuff. 100 times is too big difference.

BhaaLseN commented 2 years ago

A TryParse method sounds interresting, but I wonder what the 85% actually is. Because I'd also expect TryParse to give me information on what didn't parse; which at the very least means IXmlLineInfo when called with LoadOptions.SetLineInfo if not the full info the thrown XmlException would have otherwise.

The main validation scenarios I have (that currently use the try/Parse/catch pattern) also provides feedback using the values obtained from the XmlException. If that ends up being the larger chunk of the 85%, providing a TryParse method wouldn't immediately bring much benefit there. (Then again, I dont have the large number of files combined with a high error rate; so I don't immediately see this large performance degradation.)

krwq commented 2 years ago

Honestly in order to provide TryParse methods which return useful error info across entire XML API surface would be relatively large amount of work which would probably be better spent trying to optimize exception handling in general and adding it just to this API would be really weird. I think it's very unlikely we're going to do such changes.

I think if you're getting lots of bad XMLs you should consider adding different protections against that (i.e. blocking users who are sending such XMLs etc)