[Bug]: Conversion from symbols to entities causes reader to crash

PhoenixIV commented 4 months ago

Bug Description

Sigil automatically converts symbols to entities (e.g. protected whitespace to  ).

Please note that this causes some readers to crash.

Yes, calibre also warns about that, but I only noticed when I experienced it myself. One reader that crashes is https://www.epubread.com/ for example. So please be wary and reconsider this.

Also, Sigil behaves unexpectedly: When importing a book with missing doctype it adds it and converts symbols to entities. When importing a book with good doctype it does not convert symbols to entities.

I created test files for you: symbol vs entity.zip

Platform (OS)

Linux

OS Version / Specifics

-

What version of Sigil are you using?

2.2.0

Any backtraces or crash reports

No response

dougmassay commented 4 months ago

If a reader crashes over a spec-compliant entity, then that is a reader bug.

But regardless of that... remove the non breaking space entity (either named or numeric) from Sigil's Preserve Entities list (in Sigil prefs) and Sigil will no longer convert that non-breaking space character to an entity.

There is no bug that I see here. Only misunderstood behavior/settings

PhoenixIV commented 4 months ago

I see that, but in this case I appreciate callibre's goal: Generate files that deal with bugs in devices, so they do not have to go to trash. I also think it is crazy that readers crash over this.

I therefore suggest to review the default conversion list with this in mind.

On the other hand I am not certain if the symbols mentioned might actually even be illegal by xml1.1 (epub2) standard. Which puts a whole new dimension to this.

Aside from this, there is still the inconsistency bug I mentioned.

dougmassay commented 4 months ago

Unicode characters are not illegal in the epub2 spec. And we have no default conversion list. We have user configurable choices. While we try to make sure Sigil doesn't barf all over calibre generated epubs, we make no effort to match calibre feature for feature. People who use both are always going to have hurdles and choices.

What inconsistenty bug are you referring to? If you change your Sigil prefs as I described, the Unicode non-break space characters will not be changed to entities -- regardless of the presence, or lack of, doctypes.

On Fri, Jul 12, 2024, 6:57 AM Tobias @.***> wrote:

I see that, but in this case I appreciate callibre's goal: Generate files that deal with bugs in devices, so they do not have to go to trash. I also think it is crazy that readers crash over this.

I therefore suggest to review the default conversion list with this in mind.

On the other hand I am not certain if the symbols mentioned might actually even be illegal by xml1.1 (epub2) standard. Which puts a whole new dimension to this.

Aside from this, there is still the inconsistency bug I mentioned.

— Reply to this email directly, view it on GitHub https://github.com/Sigil-Ebook/Sigil/issues/765#issuecomment-2225328120, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG3CXSZVPSNPJEWZPVJUGTZL6ZBRAVCNFSM6AAAAABKYPOPGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVGMZDQMJSGA . You are receiving this because you commented.Message ID: @.***>

PhoenixIV commented 4 months ago

Unicode characters are not illegal in the epub2 spec.

Thanks, I stopped searching after a few minutes for this information.

And we have no default conversion list

Of course you do. The setting right after standard installation is what is considered default.

I see reason to consider changing it. Actively changing something that was valid before but breaks on devices after the change is a big step.

I see beauty in both programs.

What inconsistenty bug are you referring to?

From my initial message:

Also, Sigil behaves unexpectedly: When importing a book with missing doctype it adds it and converts symbols to entities. When importing a book with good doctype it does not convert symbols to entities.

You can try this with the files I provided.

I am surprised it only cares about character conversion when it also found a missing doctype.

Hope this helps

dougmassay commented 4 months ago

The character conversion has nothing to do with the missing doctypes. Any automated processing will result in consulting the Preserve Entities list.

I'm sorry, but everything is behaving as we have intended it to in these instances. Not liking how something works does not actually make that something a bug. Which is what this reporting system is reserved for. If it's a feature change you're after, feel free to start a thread in our user forums over at MobileRead.

On Fri, Jul 12, 2024, 7:17 AM Tobias @.***> wrote:

Unicode characters are not illegal in the epub2 spec.

Thanks, I stopped searching after a few minutes for this information.

And we have no default conversion list

Of course you do. The setting right after standard installation is what is called default.

I see reason to consider changing it. Actively changing something that was valid before but breaks on devices after the change is a big step.

I see beauty in both programs.

What inconsistenty bug are you referring to?

From my initial message:

Also, Sigil behaves unexpectedly: When importing a book with missing doctype it adds it and converts symbols to entities. When importing a book with good doctype it does not convert symbols to entities.

I am surprised it only cares about character conversion when it also found a missing doctype.

Hope this helps

— Reply to this email directly, view it on GitHub https://github.com/Sigil-Ebook/Sigil/issues/765#issuecomment-2225357259, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG3CXVP3ZZ53OECVT3X2ATZL63LVAVCNFSM6AAAAABKYPOPGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVGM2TOMRVHE . You are receiving this because you commented.Message ID: @.***>

dougmassay commented 4 months ago

I forgot one thing: the vast amount of folk new to Sigil and/or epub creation often know nothing of invisible (x)html characters. Most like being able to see the entities where these special situations occur. THAT is why it's the default behavior, and why we've provided the more experienced users a way to override that behavior.

kevinhendricks commented 4 months ago

To make this issue clearer. In epub2, to use named entities ( ie. "nbsp" vs numeric entities "#160;) requires the proper epub2 doctype which is where the namespace for supporting named entities is provided.

By calibre ignoring the doctype and not requiring it , it forces calibre to convert all named entities to their character equivalents. Which is okay, since they do that, but it is NOT spec behaviour. In epub3 only numeric entities are allowed except for the basic xml entities required by parsing.

So no, this is not a bug on Sigil's part which is to generate epubs that can be made to meet the spec for epub2 or epub3 and properly use and support legal entities if the epub creator so chooses.

Sigil-Ebook / Sigil