Open IgorMenshikov opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-xml See info in area-owners.md if you want to be subscribed.
Author: | IgorMenshikov |
---|---|
Assignees: | - |
Labels: | `area-System.Xml`, `tenet-performance`, `untriaged` |
Milestone: | - |
@IgorMenshikov we usually don't optimize for fail paths but I can see this being problematic in a web app. We currently don't have plans to improve this but I'd be happy to review any contributions. Would you be interested in digging a bit to find out what causes that to be slow? You can try to start searching for workarounds, i.e.: pass input to XmlReader and simply do read loop first for initial validation (I'm not sure if problem is specific to XmlReader or XLinq) and see if that one can find errors quicker - assuming XmlReader can do this much faster it would decrease perf slightly overall but at least you wouldn't suffer heavy hit for bad actors.
i.e. calling simply method like this before calling XElement.Parse reduces validation time by 30% for me, not sure if that would be enough for you (that's around 3-4 times slower than for a good case with Validate and around 5-6 times slower than original version without Validate - tested on similarly long XML):
void Validate(string xml)
{
using (XmlReader reader = XmlReader.Create(new StringReader(xml)))
{
while (reader.Read()) ;
}
}
also consider blocking or add some throttling for bad actors
My test with 1000 iterations:
So, I do not see a big difference between XElement.Parse and XmlReader.Create.
But difference between success and fail paths is 100 times.
I have to process millions of XMLs. Usually quite fast but with many wrong XML it may take many minutes.
Profiling it shows that ~85% of the time is spent throwing the failure exception. So you would need to come up with a way that communicated failure via a means other than exceptions.
If exceptions take 85%, it would be perfect to have something like TryParse as we have with other C# stuff. 100 times is too big difference.
A TryParse
method sounds interresting, but I wonder what the 85% actually is. Because I'd also expect TryParse
to give me information on what didn't parse; which at the very least means IXmlLineInfo
when called with LoadOptions.SetLineInfo
if not the full info the thrown XmlException
would have otherwise.
The main validation scenarios I have (that currently use the try
/Parse
/catch
pattern) also provides feedback using the values obtained from the XmlException
. If that ends up being the larger chunk of the 85%, providing a TryParse
method wouldn't immediately bring much benefit there.
(Then again, I dont have the large number of files combined with a high error rate; so I don't immediately see this large performance degradation.)
Honestly in order to provide TryParse methods which return useful error info across entire XML API surface would be relatively large amount of work which would probably be better spent trying to optimize exception handling in general and adding it just to this API would be really weird. I think it's very unlikely we're going to do such changes.
I think if you're getting lots of bad XMLs you should consider adding different protections against that (i.e. blocking users who are sending such XMLs etc)
Description
I am not sure if this is a correct repo but System.Xml.Linq is a part of NET Core, so I post it here.
System.Xml.Linq.XElement.Parse has very low performance if pass incorrect XML. With correct XML the code can do 5000+ parse calls per second. Incorrect XML results in 10s per second.
Maybe it is because it throws an exception on each call.
I have 1000s of XML coming from external data source and cannot control if they are correct. So, low performance just make it impossible to process data.
My test code is below. I run it on NET Core 6.0.4.