Closed Phu2 closed 6 years ago
... Document authors are encouraged to avoid "compatibility characters" ...
I always recommend to sanitize your data before processing it with Catmandu or to use Catmandu::Fixes for this.
PICA::Writer::XML uses simple 'print'-statements to generate the XML document, so we can handle really large XML files. I will check if we could switch to a XML library like XML::Writer which would capture these errors.
I just came across a problem with character ranges below 0x20, which are present in initial plain pica files. Catmandu perfectly converts them to pica xml, but by doing so Catmandu actually generates invalid xml files, which cannot be parsed by xsl processors.
Example file (one record in plain pica, field 233P contains the unicode character 0x2) record.txt
Command
catmandu convert -v PICA --type binary to PICA --type XML < record.txt > record.xml
The resulting xml cannot be transformed (e.g. to solr xml) via xmlstarlet, xsltproc or Saxon HE. For now i delete all characters which are not allowed in xml according to the specification by using
tr -d '\000-\010\013\014\016-\037'
I wonder if Catmandu should handle this problem and only generate valid xml. What do you think?