KanjiVG / kanjivg

Kanji vector graphics
http://kanjivg.tagaini.net
Other
1.06k stars 181 forks source link

replace embedded DTD with namespace declaration #456

Closed praschke closed 3 months ago

praschke commented 3 months ago

Closes #451.

praschke commented 3 months ago

after doing some further research, i've found that ID attributes can contain colons in XML 1.0, SVG 1.0, and HTML5, and that the reason people advise against it is that IDs containing colons have to be escaped in CSS selectors. I haven't found any other reason to avoid colons.

this is fortunate as changing the IDs would require many more changes to downstream tools, compared to removing the identical and uninformative DTD embedded in thousands of files. (almost 8MB!)

benkasminbullock commented 3 months ago

The current KanjiVG SVG is already fully compliant with the SVG 1.0 standard.

praschke commented 3 months ago

it is not valid if embedded DTDs are not parsed, which is now common security practice.

praschke commented 3 months ago

i have renamed it to be less offensive.

francoisthire commented 3 months ago

I am facing the same issue, I can't parse the current SVG files, and the changes made by this MR allows me to parse them (I am using an XML parser: https://erratique.ch/software/xmlm/doc/Xmlm/index.html).

I think it would be cool to have this MR to make the SVG files parsed more easily by many tools out there.

I am not sure about the cons: Maybe it is hard to make sure that there were no other changes than the one suggested in this PR description? If so, it could be helpful to have some commands (I guess sed command?) so that we can replay the changes locally and check we get the same output.

praschke commented 3 months ago

the two commands i used were, in the kanji directory:

fd . -e svg -x perl -i -0pe 's$ \[\n<!ATTLIST g\nxmlns:kvg CDATA #FIXED "http://kanjivg.tagaini.net"\nkvg:element CDATA #IMPLIED\nkvg:variant CDATA #IMPLIED\nkvg:partial CDATA #IMPLIED\nkvg:original CDATA #IMPLIED\nkvg:part CDATA #IMPLIED\nkvg:number CDATA #IMPLIED\nkvg:tradForm CDATA #IMPLIED\nkvg:radicalForm CDATA #IMPLIED\nkvg:position CDATA #IMPLIED\nkvg:radical CDATA #IMPLIED\nkvg:phon CDATA #IMPLIED >\n<!ATTLIST path\nxmlns:kvg CDATA #FIXED "http://kanjivg.tagaini.net"\nkvg:type CDATA #IMPLIED >\n\]$$g' {}

fd . -e svg -x sed -i 's$xmlns="http://www.w3.org/2000/svg"$xmlns="http://www.w3.org/2000/svg" xmlns:kvg="http://kanjivg.tagaini.net"$' {}

you can rewrite it to use find instead of fd if you do not have fd installed. it is not very difficult to verify that these were the only changes made in the large commit.

benkasminbullock commented 3 months ago

I understand that some people are having trouble with the KanjiVG files with particular software applications. However, as I have repeatedly pointed out, according to the W3C SVG validator, the KanjiVG files are valid SVG 1.0, and it is easy and permitted for users to alter the files using, for example, your script, with the exception of the copyright notice, which should not be altered.

benkasminbullock commented 3 months ago

I am facing the same issue, I can't parse the current SVG files, and the changes made by this MR allows me to parse them (I am using an XML parser: https://erratique.ch/software/xmlm/doc/Xmlm/index.html).

I'm not prepared to go to this web page and try to download and install and then try out the XML parser that you are using. If there is a specific error from the XML parser that you are using then please inform what it is. Ideally this should be done under the "issues" page rather than as a comment on this pull request. The current KanjiVG files, as far as I know, are valid SVG 1.0.

I think it would be cool to have this MR to make the SVG files parsed more easily by many tools out there.

The one bug report about the problem with parsing the current files was so vague and non-specific and verbose as to be completely useless. The person behind that travesty seemed to think that I was much more interested in reading his endlessly-edited comments across several different websites than I am. If there is a problem with the current format of the files, please report that as a specific issue. The current KanjiVG is SVG 1.0 and I brought up the topic of upgrading to SVG 1.1 on the mailing list, but this resulted in yet more of the useless verbiage and nothing which I can take action on, let alone a consensus to do something.

I am not sure about the cons: Maybe it is hard to make sure that there were no other changes than the one suggested in this PR description? If so, it could be helpful to have some commands (I guess sed command?) so that we can replay the changes locally and check we get the same output.

The cons in this case are that we end up making unnecessary changes to a large number of files on the say-so of random people on the internet who cannot back up what they are saying with documentation in the SVG standards as to how the current files are incorrect.

francoisthire commented 3 months ago

After checking with the W3C validator here: https://validator.w3.org/check , it seems this MR produces invalid SVG 1.0 standard files. Maybe, one should just consider adding the attribute: xmlns:kvg="http://kanjivg.tagaini.net" in the SVG tag and not removing the ATTLIST tag?

The difference with adding this attribute is that it makes the file XML compliant according to this XML validator: https://jsonformatter.org/xml-validator

praschke commented 3 months ago

@benkasminbullock i explained the issue in my earlier comments on this PR, and corrected the first reporter's misinformation regarding the validity of colons in ids, which would be a much more disruptive change.

you may be correct that this is not an issue of strict compliance, but embedded DTDs are now commonly not parsed for security and performance reasons. because of this, these files will now fail to validate in many parsers. this is a usability issue.

if you are worried about breaking unknown projects that depend on the embedded DTD for parsing, then would a PR that only moves the namespace declarations out of the DTD and into the root element be acceptable?

@francoisthire the w3c validator requires files to be valid under specific recognized schemas with no allowance for extensibility. unrecognized namespaces are incorrectly treated as invalid.

benkasminbullock commented 3 months ago

This is the result of trying to parse the SVG in KanjiVG using the SVG 1.1 standard. The suggested fix in this pull request is not adequate in my opinion. We should either change the entire repository to SVG 1.1 or at least offer a release with SVG 1.1 versions of the files.

benkasminbullock commented 3 months ago

Unfortunately my comment above is mistaken, this is nothing to do with the SVG 1.1 standard. The ATTLIST and the kvg: prefix information are perfectly valid SVG 1.1. Removing the ATTLIST and the kvg: prefixed information on the groups and strokes is not an acceptable solution to the failures of people's XML parsing software. The only thing which I could do to help would be to offer a release of KanjiVG with the attributes removed for the use of people working with these kinds of XML parsers.