N0taN3rd / wail

:whale2: One-Click User Instigated Preservation
http://matkelly.com/wail
GNU General Public License v3.0
121 stars 9 forks source link

Issues With Saving Chinese Language Website #103

Open milliem-3923 opened 5 years ago

milliem-3923 commented 5 years ago

Are you submitting a bug report or a feature request?

Bug report

What is the current behavior?

When opened in the wayback viewer the Chinese font of a certain website is all muddled and unreadable; a bit like if I saved a Chinese excel file and then reopened it with the wrong encoding. Even if you can't read Chinese you can see the original and the saved version are different. As far as I can tell it's only this webpage that exhibits this issue.

STEPS: Save pages from the website below and then open them in the viewer. http://kksk.org/youji/r_812_1.html

What is the expected behavior?

The pages should be readable and the same as the original.

What's your environment?

WAIL: 1.2.0-beta3.5 OS: 64 bit Windows 10 (Home ) Version 10.0.17134 Build 17134 (presents the same in both firefox and chrome)

Thanks, hope this is in the right place. I spent a lot of time saving these pages so it would also be nice to know if the files can be salvaged.

machawk1 commented 5 years ago

I was able to replicate in a slightly different environment due to the releases provided.

I used the WAIL 1.2.0-beta3 binary for macOS (the latest listed in releases for the platform). I created a new collection and did a page-only archiving process of http://kksk.org/youji/r_812_1.html.

The characters on the archived page are different than those on the live Web, with some being displayed as "unknown":

screen shot 2019-02-02 at 10 16 41 am

Live Web:

liveweb

Perhaps this is a change in encoding at replay time.

WARC: kksk.org!youji!r_812_1.html-default-1549120546141.warc.txt

machawk1 commented 5 years ago

Content-Type: text/html; charset=GBK is consistent between the replayed memento and the live Web (GBK→simplified Chinese) as verified via curl -I (respective URI-R/M).

milliem-3923 commented 5 years ago

Sorry, I don't understand your second answer. Is that confirmation that the issue is with encoding at replay time? If so that means the original file is uncorrupted?