acoustid / acoustid-server

AcoustID's web site and API
MIT License
65 stars 21 forks source link

Some acoustid-server replication files corrupted #22

Closed adamansky closed 1 year ago

adamansky commented 12 years ago

Hi Lukáš! I've working on acoustid replication script. And found invalid replication dump: http://data.acoustid.org/replication/acoustid-update-4620.xml.bz2

xml.sax.parse failed on this particular replication set

xmllint --format ./acoustid-update-4620.xml  
./acoustid-update-4620.xml:2: parser error : PCDATA invalid Char value 31
">Séries</column><column name="artist">Television</column><column name="track">
                                                                               ^
./acoustid-update-4620.xml:2: parser error : PCDATA invalid Char value 4
column name="artist">Television</column><column name="track">xœcpJLOILQHNLKUÀ

It may be bug in xml.etree.cElementTree (used in export_tables.py) but xml ecaping should be performed well during xml generation as shown in sample:

>>> r = etree.Element('test')
>>> r.text = u'bla &'
>>> etree.tostring(r, encoding="UTF-8")
"<?xml version='1.0' encoding='UTF-8'?>\n<test>bla &amp;</test>"

Simple repair solution:

tidy -xml  -o ./acoustid-update-4620-fixed.xml ./acoustid-update-4620.xml
lalinsky commented 12 years ago

The problem is that there are some weird characters in the meta table, including 0x04 and 0x06 ASCII control characters. Those are obviously not XML compatible, but xml.etree accepted them and printed directly to the output. It seems that lxml.etree would raise an exception.

acoustid=> select * from meta where id=3454685;
-[ RECORD 1 ]+---------------------------------
id           | 3454685
track        | \x1FxœcpJLOILQHNLKUÀ\x04 xƒ\x06C
artist       | Television
album        | Séries
album_artist | 
track_no     | 
disc_no      | 
year         | 2000

There is no reason why such characters should be there, so I guess I'll have to add some extra validation and I'll also switch to lxml. It seems more reliable.