RazrFalcon / svgcleaner

svgcleaner could help you to clean up your SVG files from the unnecessary data.
GNU General Public License v2.0
1.62k stars 93 forks source link

Parsing seems to fail on innocuous attributes #216

Open technicaltitch opened 4 years ago

technicaltitch commented 4 years ago

I have a lot of files to parse, generated by pdf.js. As far as I can tell, and as far as I've found parsing them myself into a database, and rendering in Firefox and Chrome, the SVG is OK. However the majority of my files get the error:

Error: invalid attribute value at 1:29465.

Here are some examples of the characters at the associated positions:

The second 5 in fill="rgb(255,242,0)" line 1 pos 20350:

page00004.svg.zip

The t in width in stroke-width="1.5px" line 1 pos 2224:

page00008.svg.zip

Others were stroke-dashoffset="0px" and stroke-linejoin="".

All pages except one fail in this book:

Tigrigna Student Textbook - Grade 11.zip

Is there anything I can do to relax the parsing, eg, so it just ignores attributes it doesn't like rather than fails the file? I am only SAX parsing as XML but the documents seem well formed (if undoubtedly pretty complex). I can't see a pattern, but any hints on what might be going wrong so I can preprocess?

Huge thanks

RazrFalcon commented 4 years ago

It doesn't like empty attributes. So you should try removing them.

As for errors, the is a bug, so it prints line/column in bytes, not in characters. And your files has some multibyte characters, so it produces an invalid position.

JoKalliauer commented 4 years ago

@technicaltitch

You might think of running svgomg first, that resolves the problem.

Comparision between optimizers for page00004.svg.zip -svgo produces: SVGO.svg.txt -scour produces: scour.svg.txt -svgcleaner produces: svgcleaner.svg.txt

daviddeutsch commented 4 years ago

@RazrFalcon I found another issue and I think I've narrowed it down to this example:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="250" height="250" viewbox="0 0 950 250">
  <defs>
    <clipPath id="1">
      <path style="opacity:0" d="M 0,157.57797 H 250 V -92.422028 H 0 Z" transform="translate(700,92.422028)"/>
    </clipPath>
  </defs>
        <g clip-path="url(#1)">
      </g>
</svg>

The error seems to be: Any ID that starts with a number. I understand that this is not valid XML, but perhaps the error could be a bit more helpful, here?

RazrFalcon commented 4 years ago

What it prints now?

daviddeutsch commented 4 years ago

@RazrFalcon not sure whether you were replying to my request, but the error was the same as in the OP message:

Error: invalid attribute value at 8:23.

so right at the url(#1) mark. When I change the id to #a, it works just fine.

RazrFalcon commented 4 years ago

@daviddeutsch Thanks, the error is definitely obscure. And yes, id cannot start from a number. But it will be relaxed in the next release, since this is just too common.

daviddeutsch commented 4 years ago

@RazrFalcon Thank you. I agree that it does make sense to relax the restriction here.

And thank you for svgcleaner! I was desparately looking for a replacement for svgo which just kept breaking in a million ways and this one has been a true blessing :bow:

JoKalliauer commented 4 years ago

@daviddeutsch Every svg optimizer has its advantages and disavantages, and correctness of svgs is not svgo's strength see https://github.com/RazrFalcon/svgcleaner#correctness

advantages of svgcleaner compared to svgo

disadvantages of svgcleaner compared to svgo

But as you already highlighted, correctness might be the most important feature.

daviddeutsch commented 4 years ago

@JoKalliauer Indeed. I need to process a lot of SVGs and even slight errors are completely unacceptable.

It's astonishing to me that svgo is as lossy/destructive as it is with its default settings and the options you have don't really get you to a lossless place. I've tried for about three years now and have finally given up when the (indeed, nice if it'd work) mergePaths completely destroyed a rendered text.

That it's also a lot faster is really the icing on the cake. I'm kicking myself for not finding it earlier.

Hey @RazrFalcon what can I do to help out? I mean on the project in general.

RazrFalcon commented 4 years ago

@daviddeutsch Nothing, really. The project has some fundamental flaws, so it has to be rewritten from scratch. It will take 2-3 months, and I simply don't have so much time at the moment.

Also, the new version will lose most of the options, since "selective cleaning" is a dead end and can't be implemented correctly. This will probably disappoint some users.

daviddeutsch commented 4 years ago

@RazrFalcon to be honest, the "cleaning" part of svgcleaner doesn't even interest me :wink:

Sorry to hear that there is so much work to be done. It would be helpful if you could provide some details on the fundamental flaws you mentioned. If somebody else decides to fork or copy your work (perhaps in another language), knowing the problems you ran into would be a great asset.

RazrFalcon commented 4 years ago

The main problem is that you have to know how SVG works very-very well. I thought that I do, but then I wrote resvg and learned that I didn't know anything. There are just too many edge-cases in SVG and svgcleaner ignores most of them.

The main flaw of svgcleaner is that it tries to preserve the input XML. This is a dead end. You just can't. XML/SVG simply aren't lossless. So in a new version, the SVG structure reconstruction will be mandatory. Which will make most of the options obsolete.

Also, svgcleaner doesn't really support filters. But I guess this can be fixed without a rewrite.

daviddeutsch commented 4 years ago

@RazrFalcon Interesting! That brings me back to my original plan. Since I really only need a very fast svg compressor, I'd be willing to trade percentage points of compression for speed and going trough your - very excellent! - documentation, maybe it'd be easy to get most of the cleaning options down so you get, say, 90% of the "perfect" compression while staying away from those pitfalls that you mentioned.

So the plan that I might try one day is to take your documentation, take all the rules that have a meaningful impact on file size (say, everything that reduces a file by more than 20%) and implement them in something fast like luajit.

RazrFalcon commented 4 years ago

What do you mean by compressor?

daviddeutsch commented 4 years ago

@RazrFalcon minifying the SVG - throwing out everything that is unnecessary to increase download speed.

JoKalliauer commented 4 years ago

@daviddeutsch. I think the save options, where you stay away from any pitfall weill be lower than 10% of the perfect optimization. CSS can break almost anything. (I don't know if there exist any optimization without any pitfall, maybe display="none".) A 100% safe optimization will have neglegtable reduction.

daviddeutsch commented 4 years ago

@JoKalliauer So far, I've only worked on HTML compression which, to my surprise, improves upon gzip compression with simple tricks like removing all the whitespace outside of text content.

From that experience, I would very much agree - you often only need a handful of tools to get 80 or 90% of the way to "perfect" file size reduction and those last 10-20% are simply not worth the effort, especially because you end up treading dangerous waters where you think your optimization works but you actually broke your content for some browser or use case. So not only do you have to invest time into programming a complex solution (for negligible benefit), you actually have to maintain that complexity which can be a tough lot and sometimes even a moving target.

'Choose your battles wisely', or so the say :wink:

RazrFalcon commented 4 years ago

@daviddeutsch I see. I thought about a gzip compression.

Essentially, this is what svgc is trying to do. Remove as much as possible, without changing the way an image is rendered. Sadly, I do not have time to work on it right now.

And svgc doesn't implement a lot of things yet. Text processing doesn't exist. Filters are resaved as is. Etc.

By the way, the hardest part of an app like svgcleaner is not to clean an SVG, but to parse it correctly in the first place. You will need a very good XML (DTD, namespaces, etc.) and CSS (parser + selector) libraries. Then you have to parse SVG value types (something like Path Data or Transform list), which is not trivial. And then you have to create an SVG tree (not an XML tree). The tree itself is pretty tricky, because you have to preprocess whitespaces correctly to get a valid text (yes, xml:space and stuff). And use/xlink resolving isn't trivial too. And only then you can actually start "cleaning" an SVG.

RazrFalcon commented 4 years ago

@daviddeutsch Also, you can use resvg-test-suite as a test polygon. As long as you can process all those files correctly - you are pretty much done. And this test suite contains most absurd SVG you could find. Sadly, I wrote it after svgcleaner. So it would probably fail pretty badly.

daviddeutsch commented 4 years ago

@RazrFalcon You are right, which is why I usually don't do parsing :wink:

The full code for the html compressor isn't in any shape that I'd be comfortable to share, but it's literally just text substitution. This, for instance, is how I compress whitespace:

local html = io.read("*all")

html = string.gsub(html, "[%s\r\n]*<", "<")

html = string.gsub(html, ">[%s\r\n]+<", "><")

I haven't done any kind of scientific comparison, but I just kept adding small optimizations like that until I came near (like within 15% of) the best "perfect compression" html tools that I found and called it a day.

Thank you very much for mentioning the test suite. I'm sure it'll be a valuable asset :bow:

daviddeutsch commented 4 years ago

@RazrFalcon Woah, I just realized that your resvg cli tool supports basically all of the inkscape CLI options that I need - exporting individual IDs (to SVGs, even! with --dump-svg), exporting to pngs, even listing all IDs with their coordinates.

I think the only thing that is missing is explicitly deciding whether the exported SVG is cropped to the size of the element or if it

inkscape is of course very slow for those tasks, like exporting an individual ID takes 300ms - and I often run tasks where I'm exporting hundreds of those… So I was already on the lookout for an alternative.

RazrFalcon commented 4 years ago

@daviddeutsch Yeap, and it has a far better SVG support.

explicitly deciding whether the exported SVG is cropped to the size of the element or if it

What do you mean by that? Automatic SVG size detection?

daviddeutsch commented 4 years ago

@RazrFalcon In inkscape CLI, exporting an ID means that a 50x50px rectangle on a 200x200px viewBox will result in a 50x50px image. If you use --export-area-page, you get a 200x200px image.

In my application, I'm taking multiple input SVGs that have the same viewBox size, export some elements of each and compose those into a new file - by copying the elements into a new SVG with the same viewBox. So for that, I need to keep the coordinates as they were in the original source SVG.

In step 2, I take that composite SVG and split it up into files (some PNG, some SVG) which I position in an HTML file with the coordinates that I read out earlier (well, a percentage value so it's somewhat 'responsive'). For this step, I need cropped images where every SVG element starts at 0,0, internally.

RazrFalcon commented 4 years ago

Yes, right now you have to compose those images manually.

I will think about adding such an option to resvg.

daviddeutsch commented 4 years ago

@RazrFalcon That'd be great! Thanks for considering! :bow: