dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.39k stars 4.75k forks source link

[API Proposal]: Add EntityHandling property to XmlReaderSettings #86107

Open rampaa opened 1 year ago

rampaa commented 1 year ago

Background and motivation

I need to read an XML file with XmlReader without expanding its entities. I can do this by setting XmlTextReader's EntityHandling property to EntityHandling.ExpandCharEntities. But same cannot be done with an XmlReader instance created by the XmlReader.Create method, because XmlReaderSettings does not expose any settings for it.

The problem with using XmlTextReader is that XmlTextReader does not let me modify the MaxCharactersFromEntities property and not being able to modify that property causes me to get the following exception while reading the aforementioned XML file: System.Xml.XmlException: The input document has exceeded a limit set by MaxCharactersFromEntities.

XmlTextReader also doesn't support async methods (e.g., ReadElementContentAsStringAsync()) and Microsoft documentation seems to suggest we should prefer using XmlReader.Create over XmlTextReader. That's why instead of requesting XmlTextReader to allow me to modify the MaxCharactersFromEntities, I am requesting XmlReader to let me modify how it handles entities through its settings.

API Proposal

namespace System.Xml;

public sealed class XmlReaderSettings
{
    private EntityHandling _entityHandling;

    public EntityHandling EntityHandling
    {
        get
        {
            return _entityHandling;
        }
        set
        {
            CheckReadOnly(nameof(EntityHandling));
            _entityHandling = value;
        }
    }

    /* Rest of the XmlReaderSettings class */
}

API Usage

XmlReaderSettings xmlReaderSettings = new()
{
    DtdProcessing = DtdProcessing.Parse,
    IgnoreWhitespace = true,
    MaxCharactersFromEntities = 0,

    // Newly added property
    EntityHandling = EntityHandling.ExpandCharEntities
};

using XmlReader xmlReader = XmlReader.Create(path, readerSettings);

/* XML reading logic goes here */

Alternative Designs

Instead of XmlReaderSettings having a setting for EntityHandling, XmlTextReader letting me modify the MaxCharactersFromEntities would also solve the problem for me.

Risks

No response

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-xml See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation I need to read an XML file with `XmlReader` without expanding its entities. I can do this by setting `XmlTextReader`'s [EntityHandling](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmltextreader.entityhandling) property to `EntityHandling.ExpandCharEntities`. But same cannot be done with the `XmlReader.Create` method, because `XmlReaderSettings` does not expose any settings for it. The problem with using XmlTextReader is that XmlTextReader does not let me modify the `MaxCharactersFromEntities` property and not being able to modify that property causes me to get the following exception while reading the aforementioned XML file: `System.Xml.XmlException: The input document has exceeded a limit set by MaxCharactersFromEntities`. XmlTextReader also doesn't support async methods (e.g., `ReadElementContentAsStringAsync()`) and Microsoft documentation seems to suggest we should prefer `XmlReader.Create` over `XmlTextReader`. That's why instead of requesting `XmlTextReader` to allow me to modify the `MaxCharactersFromEntities`, I am requesting XmlReader to let us modify how it handles entities through its settings. ### API Proposal ```csharp namespace System.Xml; public sealed class XmlReaderSettings { private EntityHandling _entityHandling; public EntityHandling EntityHandling { get { return _entityHandling; } set { CheckReadOnly(nameof(EntityHandling)); _entityHandling = value; } } /* Rest of the XmlReaderSettings class */ } ``` ### API Usage ```csharp XmlReaderSettings xmlReaderSettings = new() { DtdProcessing = DtdProcessing.Parse, IgnoreWhitespace = true, MaxCharactersFromEntities = 0, EntityHandling = EntityHandling.ExpandCharEntities }; using XmlReader xmlReader = XmlReader.Create(path, readerSettings); /* XML reading logic goes here */ ``` ### Alternative Designs Instead of `XmlReaderSettings` having a setting for `EntityHandling`, `XmlTextReader` letting me modify the `MaxCharactersFromEntities` would also solve the problem for me. ### Risks _No response_
Author: rampaa
Assignees: -
Labels: `api-suggestion`, `area-System.Xml`
Milestone: -
krwq commented 1 year ago

Current limit for XmlTextReader is at 10^7 characters from entities. I haven't found a way to workaround that other than reflection. Just curious, what's the scenario where you need more?

I'm not quite sure I understand the problem you're seeing with using XmlReader directly and why new API is needed? Is it because you need to access XmlNodeType.EntityReference directly? (XmlTextReader let's you access that node type while directly using XmlReader is expanding that earlier). Can you provide end to end example of what you're trying to achieve? (code, sample XML and expected output)

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

rampaa commented 1 year ago

Just curious, what's the scenario where you need more?

I need to parse JMdict and JMnedict for the purposes of my program and those files make heavy use of entity references. JMnedict always exceeded that limit but recently JMdict started to exceed it as well.

I'm not quite sure I understand the problem you're seeing with using XmlReader directly and why new API is needed? Is it because you need to access XmlNodeType.EntityReference directly?

Yes.

Can you provide end to end example of what you're trying to achieve? (code, sample XML and expected output)

Sure.

XML: JMdict_e.gz Code: https://github.com/rampaa/JL/blob/46ab831c07d9129e82ee657fe98d9379e97c9253/JL.Core/Dicts/EDICT/JMdict/JmdictLoader.cs#L16 https://github.com/rampaa/JL/blob/46ab831c07d9129e82ee657fe98d9379e97c9253/JL.Core/Dicts/EDICT/JMdict/JmdictLoader.cs#L325 Output: The ReadEntity method returns v5u-s (i.e., the unexpanded entity name) and it also adds its expanded value (i.e., Godan verb with 'u' ending (special class)) to a dictionary.

The problem with XmlTextReader is that I get the System.Xml.XmlException: The input document has exceeded a limit set by MaxCharactersFromEntities exception because XmlTextReader does not let me modify the MaxCharactersFromEntities property.

The problem with XmlReader instance created by the XmlReader.Create method is that it expands the entity beforehand and therefore I cannot get the unexpanded entity name (e.g., v5u-s) even though it's crucial for my program to do so, I can only access to its expanded value (e.g., Godan verb with 'u' ending (special class)).

rampaa commented 1 year ago

@krwq Did you have the time to have a look at the code examples I've provided? Is there anything unclear about them that I can help with?

gespyla commented 1 month ago

@rampaa regarding the JMdict's "MaxCharactersFromEntities" error, could you help a non-programmer? Wanted to load jmdict into Excel or Power BI and see this error. How were you able to fix the issue? Is it maybe that some specific few lines are too long and you can shorten that in a notepad (or cut the whole entry and re-add it manually)? That's what I "understood" from your posts there (I just thought it's because of too many entries, i.e. file being too long). Are you able to tell me what to do with the jmdict file (or perhaps link me a "cured" file)? You're the only thing that I could google about that error.

rampaa commented 1 month ago

@gespyla

How were you able to fix the issue?

By keeping a cache for entity names and their expanded values and not calling the ResolveEntity method if the cache has an entry for it. Something like the following should work:

    public static readonly Dictionary<string, string> JmdictEntities = new();

    public static (string entityName, string expandedEntityValue) ReadEntity(XmlTextReader xmlReader)
    {
        _ = xmlReader.Read();

        string entityName = xmlReader.Name; // unexpanded entity name e.g., "v5u-s"

        if (!JmdictEntities.TryGetValue(entityName, out string? expandedEntityValue))
        {
            xmlReader.ResolveEntity();
            _ = xmlReader.Read();

            expandedEntityValue = xmlReader.Value; // e.g., "Godan verb with 'u' ending (special class)"
            JmdictEntities.Add(entityName, expandedEntityValue);
        }

        _ = xmlReader.Read();

        return (entityName, expandedEntityValue);
    }
gespyla commented 1 month ago

thanks @rampaa ! is it just opening XMLReader, opening the XML file in it and (1) pasting this code and then save, export or something like that? And the file should hopefully not scream this error when opened by programs like Excel? Or alternatively (2) is it to be swapped with a section in a file? Or (3) am I out of my league and need to learn this stuff first?

rampaa commented 1 month ago

need to learn this stuff first?

I don't know what kind of output you are trying to create exactly but yeah, you probably should learn about how to read an XML file with XmlReader/XmlTextReader first. The JMdict XML file itself is somewhat complex, so parsing it for your own purposes might not be as straight forward as you might hope. You can see my code (https://github.com/rampaa/JL/blob/master/JL.Core/Dicts/JMdict/JmdictLoader.cs) as a reference if you wish to, it parses the JMdict file and keeps it in memory for fast look-ups. Alternatively, there's a simplified version of JMdict in JSON format, which you may find easier to work with/repurpose for your needs compared to the original XML format.