CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
682 stars 419 forks source link

[QUESTION] Valid xml document #1235

Open jamoore5 opened 4 years ago

jamoore5 commented 4 years ago

Should the XML generator be updated to guarantee that it is generating valid XML?

Example output from one of my video that I exacted subtitles from

1029 | <!--
1030 | And that's exactly it--   that
1031 | human beings perceive patterns
1032 | -->

I was getting an invalid token error from 3 different parsers that I tried to read it with. When I validated the XML in a validator I found

1030: | 25 | The string "--" is not permitted within comments.

Is this a possible issue in the code base?

cfsmp3 commented 4 years ago

Yes, I don't think we're escaping anything - just happily concatenating strings without a care in the world :-)

jamoore5 commented 4 years ago

Is there a reason that the subtitles are a comments instead of text within the element?

Sorry, I do not know much about subtitles and formats.

cfsmp3 commented 4 years ago

Is there a reason that the subtitles are a comments instead of text within the element?

If you mean when writing spupng, because they're not supposed there at all - the XML is just an index to the image files that contain the subtitles. We do add them to the XML for debugging purposes, and only if the OCR subsystem is present. spupng doesn't need a text representation of the subtitles.

jamoore5 commented 4 years ago

@cfsmp3 thanks that was a helpful explanation.

I got introduced to this library and the concept of spupng when following this raspberry pi tutorial. https://magpi.raspberrypi.org/articles/make-comics-from-tv-recordings

The funny thing is that the content in the XML is a lot cleaner then what I exact out of my pngs using the above tutorial. On my raspberry pi, I get boxes instead of text in the png, but the right text in the xml. On my mac, I get the text in the pngs but it is not clear as the text in the xml.

For this project I was going to take advantage of the helpful debug addition and parse the text from the xml.

PulkitMishra commented 4 years ago

hey @cfsmp3 do we need the -- in the xml file . I mean clearly -- is not allowed in XML files. So do we remove it or add a space between the two hyphens?

cfsmp3 commented 4 years ago

@PulkitMishra if we're doing any work on this it should be to generate proper XML. But honestly I don't think this is high priority (based on number of users), but it's low hanging fruit so if you want to do it to get started by all means go ahead.

Just remember that the goal is always to improve quality of output -just removing things we don't like is not a good idea :-)

dak-x commented 3 years ago

@PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions:

  1. Remove the subtitle text comments from the spu-xml for production. Only keep it for debug builds. We can display a warning message while using spu-xml on debug builds for the case when someone might accidentally generate it.

  2. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens.

Please suggest what you think is appropriate. And also confirm my above assumption.

cfsmp3 commented 3 years ago

@PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions:

The invalid XML is caused by our writer not escaping some characters at all to comply with XML specifications. It's not a problem with the subtitles (which don't care at all about XML) or the format itself.

It's our own code being very sloppy here.

  1. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens.

Yes, this is the right approach.