balthisar / tidy

Balthisar Tidy for macOS HTML Cleaner
https://www.balthisar.com/software/tidy
23 stars 7 forks source link

Import of html coded with unicode UTF-8 fails #8

Open ihemsen opened 4 years ago

ihemsen commented 4 years ago

I just started using Baltisar Tidy after having used PageSpinner for many years. When opening the old html files, Balthisar displays a message about false input-encoding and offers to convert to MacOSRoman. Accepting this garbles the document. Apparently Balthisar has guessed wrong. Choosing "Ignore" does not import any text to the document. A sample html-file and screen shot attached.

Skjermbilde 2020-03-22 kl  17 12 53

nett.html.zip

I use macOS Catalina 10.15.3 (19D76) Balthisar Tidy version 4.2.0

balthisar commented 4 years ago

@ihemsen, sorry I didn't notice this issue earlier. Can you post a sample file, so I can give best advice?

In the meantime, if you click ignore, it will look like nothing is imported; however, as long as you don't make any changes to the source HTML, then you can select different input-encoding settings to see if there's one that works.

Balthisar Tidy only uses macOS' guess, so the original document even macOS is guessing wrong.

Edit: oops, I see the attachment. I'll be back.

balthisar commented 4 years ago

If I bring your sample document into BBEdit, for example, it shows me that it's Western (ISO Latin 1) with Unix line endings. It looks good, and all of the diacriticals look good, or at least not garbled (I don't really read Norwegian).

In Balthisar Tidy, using the guessed MacOS Roman screws things up severely.

If I manually choose Western (ISO Latin 1) as the input encoding, the document then displays correctly.

@ihemsen, I think knowing this will solve your issue. As far as fixing the bug, as indicated above, it's macOS doing the guessing, so I probably won't be able to fix the issue. I imagine that the reason BBEdit works is because they have their own character encoding libraries dating back to the System 7 era!

In any case, let me know if this works for you, and again, sorry for the delay.

ihemsen commented 4 years ago

Thanks @balthisar, This works, but the user interface was not obvious even with your guidance above. To find the input encoding option I first looked in the Edit menu, then in the File menu and finally found it via the Preferences and Tidy-tab. Then the document loaded correctly the next time. Having done that I now see the input encoding document setting also in the left pane (Tidy options).

Other SW I have worked with has had the encoding change in the Edit menu. As a new Balthisar Tidy user I did look the "obvious" places in the menu system.

So when importing this html source the following happens

  1. Mac OS reports the character set as "Mac OS Roman"
  2. Baltisar Tidy has "Unicode (UTF-8)" as input encoding
  3. The source code itself states "content="text/html; charset=iso-8859-1"

Bathisar detects a mismatch between 1 and 2 and ignores 3. Then it displays the following dialogue box:

"Balthisar Tidy opened your document "index.html" successfully, but it appears that the Tidy input-encoding is not properly set. Currently "Unicode (UTF-8)" is specified.

Balthisar Tidy will automatically set input encoding to "Western (Mac OS Roman)" for you (unless you choose to ignore). This guess may not be correct, so you should have a look at the Source HTML afterwards and choose the correct input-encoding for this document before making any other changes. [Allow Change] [Ignore]

Hint: you can choose a default input-encoding in Preferences if you open this type of file often."

First of all: Would it be possible for Balthisar Tidy to use the encoding declared by the source code as a third option? [Allow Change] [Use "iso-8859-1"] [Ignore]

The empty source html window is also a bit confusing. Perhaps Balthisar Tidy could display greyed text that can be scrolled but not edited until the input-encoding is changed? If you click in the window, a dialogue box telling you to set input-encoding in the left pane.

The text in the dialogue box sited above is understandable when you are familiar with Balthisar Tidy, but importing old html is perhaps the first thing a new user does. I suggest to change the text:

Balthisar Tidy has detected a mismatch between the character set reported by Mac OS, "Mac OS Roman" and the input-encoding setting "Unicode (UTF-8)". The input-encoding setting can be found in the left pane "TIDY OPTIONS". Try different encodings and check the text to see if it is displayed correctly. Baltisar Tidy will import the text as "Mac OS Roman" [OK] Hint: you can choose a default input-encoding in Preferences if you open this type of file often."

balthisar commented 4 years ago

Thanks for feeding back. I'll see what I can do. I might have to refer this to the upstream project (HTML Tidy proper), as Balthisar Tidy doesn't touch encoding (other than trying to set the input-encoding if it detects no document upon loading). I might considering looking at the charset definition, but I generally try to let HTML Tidy do the heavy lifting.

I'm proud of the human interface, though, but it's apparent that this isn't working. I will definitely do something to improve the experience in an upcoming release. I love hearing this type of feedback and being able to improve things.

Thanks.

ihemsen commented 4 years ago

Brilliant! I am looking forward to tidying my website with Balthisar.

balthisar commented 3 years ago

@ihemsen, it's been a while, and I hope you're still using Balthisar Tidy. If so, and you've not updated to the newest version, I'd love to know your opinion on the behavior of the newest version when there are encoding mismatches!