Closed GoogleCodeExporter closed 8 years ago
Can you confirm that replacing the condition in this block (pugixml.cpp, line
2612 in trunk):
if (cursor->parent)
{
PUGI__PUSHNODE(node_pcdata); // Append a new node on the tree.
cursor->value = s; // Save the offset.
s = strconv_pcdata(s);
PUGI__POPNODE(); // Pop since this is a standalone.
if (!*s) break;
}
With "if (true)" gives you the desired behavior? I.e. that the only thing
required for fragment parsing that pugixml does not do right now is preserving
PCDATA nodes that don't have a parent node.
Original comment by arseny.k...@gmail.com
on 10 Feb 2013 at 7:54
Yep, looks like that's it.
Original comment by philly.d...@gmail.com
on 11 Feb 2013 at 9:50
One last thing. Since this is likely to be called frequently and on small
fragments, it would be better for performance if parsing fragment had its own
method that would bypass encoding detection and conversion (native pugixml
encoding assumed) but would still skip BOM.
Original comment by philly.d...@gmail.com
on 14 Feb 2013 at 1:59
I looked into making this change in pugixml with a special parsing option flag,
and unfortunately there are some issues.
pugixml has a major assumption that the last character of the XML buffer is
never required by the user. This is needed to allow in-place parsing of a non
null-terminated buffer (see load_buffer) but to return data using
null-terminated strings. Obviously, if you have a 4-byte "text" buffer, it's
impossible to create a document that will return null-terminated "text" string
without extra string copies.
This assumption holds for all well-formed XML documents, but does not hold for
document fragments.
The only way to implement this, then, is to essentially make sure that if
parsing works in fragment mode, there buffer being parsed is zero-terminated.
This means that all APIs will have to do a string copy for fragment parsing
(currently load_buffer_inplace does not do that if possible), making fragment
parsing more expensive than full document parsing.
Original comment by arseny.k...@gmail.com
on 21 Aug 2013 at 6:38
In the scenario I had in mind, the input is a null terminated string. Basically
what I need this for is to parse a pre-indexed fragment of a very large xml
document. The fragment is guaranteed not to end inside a tag. Ex input: "My
sentence.<bold> My bold sentence.</bold> My other sentence."
Original comment by philly.d...@gmail.com
on 22 Aug 2013 at 1:26
This is now fixed as of r980 (phew). parse_fragment flag enables fragment mode
parsing.
The only caveat is that if you use load_buffer_inplace or
load_buffer_inplace_own, you have to provide a null-terminated buffer (i.e. for
"test" invoke it on a buffer that has a null terminator and pass the size 5)
for all contents to be preserved.
If you don't use load_buffer_inplace_* then pugixml figures everything out
internally.
Original comment by arseny.k...@gmail.com
on 11 Feb 2014 at 6:51
Original issue reported on code.google.com by
philly.d...@gmail.com
on 8 Feb 2013 at 8:45