Open rampaa opened 1 year ago
Tagging subscribers to this area: @dotnet/area-system-xml See info in area-owners.md if you want to be subscribed.
Author: | rampaa |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Xml` |
Milestone: | - |
Current limit for XmlTextReader is at 10^7
characters from entities. I haven't found a way to workaround that other than reflection. Just curious, what's the scenario where you need more?
I'm not quite sure I understand the problem you're seeing with using XmlReader directly and why new API is needed? Is it because you need to access XmlNodeType.EntityReference
directly? (XmlTextReader let's you access that node type while directly using XmlReader is expanding that earlier). Can you provide end to end example of what you're trying to achieve? (code, sample XML and expected output)
This issue has been marked needs-author-action
and may be missing some important information.
Just curious, what's the scenario where you need more?
I need to parse JMdict and JMnedict for the purposes of my program and those files make heavy use of entity references. JMnedict always exceeded that limit but recently JMdict started to exceed it as well.
I'm not quite sure I understand the problem you're seeing with using XmlReader directly and why new API is needed? Is it because you need to access XmlNodeType.EntityReference directly?
Yes.
Can you provide end to end example of what you're trying to achieve? (code, sample XML and expected output)
Sure.
XML: JMdict_e.gz
Code: https://github.com/rampaa/JL/blob/46ab831c07d9129e82ee657fe98d9379e97c9253/JL.Core/Dicts/EDICT/JMdict/JmdictLoader.cs#L16
https://github.com/rampaa/JL/blob/46ab831c07d9129e82ee657fe98d9379e97c9253/JL.Core/Dicts/EDICT/JMdict/JmdictLoader.cs#L325
Output: The ReadEntity
method returns v5u-s
(i.e., the unexpanded entity name) and it also adds its expanded value (i.e., Godan verb with 'u' ending (special class)
) to a dictionary.
The problem with XmlTextReader
is that I get the System.Xml.XmlException: The input document has exceeded a limit set by MaxCharactersFromEntities
exception because XmlTextReader
does not let me modify the MaxCharactersFromEntities
property.
The problem with XmlReader
instance created by the XmlReader.Create
method is that it expands the entity beforehand and therefore I cannot get the unexpanded entity name (e.g., v5u-s
) even though it's crucial for my program to do so, I can only access to its expanded value (e.g., Godan verb with 'u' ending (special class)
).
@krwq Did you have the time to have a look at the code examples I've provided? Is there anything unclear about them that I can help with?
@rampaa regarding the JMdict's "MaxCharactersFromEntities" error, could you help a non-programmer? Wanted to load jmdict into Excel or Power BI and see this error. How were you able to fix the issue? Is it maybe that some specific few lines are too long and you can shorten that in a notepad (or cut the whole entry and re-add it manually)? That's what I "understood" from your posts there (I just thought it's because of too many entries, i.e. file being too long). Are you able to tell me what to do with the jmdict file (or perhaps link me a "cured" file)? You're the only thing that I could google about that error.
@gespyla
How were you able to fix the issue?
By keeping a cache for entity names and their expanded values and not calling the ResolveEntity
method if the cache has an entry for it. Something like the following should work:
public static readonly Dictionary<string, string> JmdictEntities = new();
public static (string entityName, string expandedEntityValue) ReadEntity(XmlTextReader xmlReader)
{
_ = xmlReader.Read();
string entityName = xmlReader.Name; // unexpanded entity name e.g., "v5u-s"
if (!JmdictEntities.TryGetValue(entityName, out string? expandedEntityValue))
{
xmlReader.ResolveEntity();
_ = xmlReader.Read();
expandedEntityValue = xmlReader.Value; // e.g., "Godan verb with 'u' ending (special class)"
JmdictEntities.Add(entityName, expandedEntityValue);
}
_ = xmlReader.Read();
return (entityName, expandedEntityValue);
}
thanks @rampaa ! is it just opening XMLReader, opening the XML file in it and (1) pasting this code and then save, export or something like that? And the file should hopefully not scream this error when opened by programs like Excel? Or alternatively (2) is it to be swapped with a section in a file? Or (3) am I out of my league and need to learn this stuff first?
need to learn this stuff first?
I don't know what kind of output you are trying to create exactly but yeah, you probably should learn about how to read an XML file with XmlReader
/XmlTextReader
first. The JMdict XML file itself is somewhat complex, so parsing it for your own purposes might not be as straight forward as you might hope. You can see my code (https://github.com/rampaa/JL/blob/master/JL.Core/Dicts/JMdict/JmdictLoader.cs) as a reference if you wish to, it parses the JMdict file and keeps it in memory for fast look-ups. Alternatively, there's a simplified version of JMdict in JSON format, which you may find easier to work with/repurpose for your needs compared to the original XML format.
Background and motivation
I need to read an XML file with
XmlReader
without expanding its entities. I can do this by settingXmlTextReader
's EntityHandling property toEntityHandling.ExpandCharEntities
. But same cannot be done with an XmlReader instance created by theXmlReader.Create
method, becauseXmlReaderSettings
does not expose any settings for it.The problem with using
XmlTextReader
is thatXmlTextReader
does not let me modify theMaxCharactersFromEntities
property and not being able to modify that property causes me to get the following exception while reading the aforementioned XML file:System.Xml.XmlException: The input document has exceeded a limit set by MaxCharactersFromEntities
.XmlTextReader
also doesn't support async methods (e.g.,ReadElementContentAsStringAsync()
) and Microsoft documentation seems to suggest we should prefer usingXmlReader.Create
overXmlTextReader
. That's why instead of requestingXmlTextReader
to allow me to modify theMaxCharactersFromEntities
, I am requestingXmlReader
to let me modify how it handles entities through its settings.API Proposal
API Usage
Alternative Designs
Instead of
XmlReaderSettings
having a setting forEntityHandling
,XmlTextReader
letting me modify theMaxCharactersFromEntities
would also solve the problem for me.Risks
No response