HairySpoon / htlfc

Hypertext Legacy File Converter
GNU Affero General Public License v3.0
1 stars 0 forks source link

Error when converting files with unicode filenames #7

Open grapenavy opened 1 year ago

grapenavy commented 1 year ago

I have a massive collection of maffs that I've been putting off for years to salvage, so I was absolutely thrilled to find htlfc

However, when I test ran it with maffs that had unicode filenames, I get an error, and the conversion fails. Any ideas?

I've included a maff that causes the issue, and a screenshot of the error. Hope that helps!

samples.zip

HairySpoon commented 1 year ago

Although I cannot reproduce the exact same error message, the file you sent, does indeed break htlfc. It's not the filename. It's a tricky problem because the content declares itself to be utf-8 yet it contains multi-byte character sequences. A fix would implement a work around plus regression testing - so allow me a week or more to allocate some time to the exercise.

Meanwhile, could you please examine the file inside hacked_decoder.zip

I created this by forcing the decoder BUT it does not render properly on my workstation because I lack the applicable character set. If you can read it on your screen, then I'm on the right track.

hacked_decoder.zip

grapenavy commented 1 year ago

Thank you for your reply and clarification! I tested the hacked_decoder.html, and it seemed to render fine, despite losing some content. I've included a full page screenshot of it, along with another full page screenshot of the unzipped maff for comparison I also included a few more maffs that had issues with conversion for you to test out. The content is in Traditional Chinese / Japanese. Hope that helps!

samples2.zip

HairySpoon commented 1 year ago

I'm confident that the latest release will work - including that missing content. Could you please (re)install up to version v0.5.0 and test with your samples. My tests ran without error but I cannot be sure that the content is correct.

grapenavy commented 1 year ago

I uninstalled the previous version and installed v0.5.0 but unfortunately ran into another error (which I included below). If it helps at all, I'm running Windows 10 in Traditional Chinese.

2023_0507_2140_44

HairySpoon commented 1 year ago

Thanks for the additional feedback. It's the same error as before, the one I could not reproduce - my platform is Linux.

Although I have a theory about the bug, it would not be appropriate to issue a new version on a hunch (unlike v0.5.0 which applied legitimate fixes to another problem). So hope you could run some tests on your Windows machine.

Are you able to run htlfc from the command line? If so could you locate one of the htlfc files and replace it with a debugging copy I sent (remember to save the original). Finally, capture the output and attach it back into this GitHub issue. maff.py.zip

grapenavy commented 1 year ago

Hey there, so I replaced the 'maff.py' under the agents directory (I hope that's the right one), and batch converted the sample of maffs I sent before. Here is the log:

Converting .maff files in the directory: D:\20230501 Maff Salvage\More Maffs
This may take a while, depending on the number of files to convert...
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0103_0446_16] 訓令式羅馬字 - 維基百科,自由的百科全書.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2015_0103_0446_16] 訓令式羅馬字 - 維基百科,自由的百科全書.maff
Failed to unpack: 'cp950' codec can't decode byte 0x93 in position 392: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0107_0326_11] Tumblr 前 CTO:Apple 作業系統品質大不如前 - Inside 網摘.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2015_0107_0326_11] Tumblr 前 CTO:Apple 作業系統品質大不如前 - Inside 網摘.maff
Failed to unpack: 'cp950' codec can't decode byte 0xe5 in position 352: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0108_0417_26] 3D 列印廠 MakerBot 發表新材質,可仿石灰石、楓木、金屬質感 _ TechNews 科技新報.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2015_0108_0417_26] 3D 列印廠 MakerBot 發表新材質,可仿石灰石、楓木、金屬質感 _ TechNews 科技新報.maff
Failed to unpack: 'cp950' codec can't decode byte 0xe5 in position 406: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0108_2039_10] 小心,香噴噴的「滷汁」變成「化學汁」?! _ PanSci 泛科學.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2015_0108_2039_10] 小心,香噴噴的「滷汁」變成「化學汁」?! _ PanSci 泛科學.maff
Failed to unpack: 'cp950' codec can't decode byte 0x8f in position 340: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0120_1811_34] 【注目】開発2名! 台湾デベロッパーの本気作『Hero Emblems』の完成度が匠レベル [ファミ通app].maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2015_0120_1811_34] 【注目】開発2名! 台湾デベロッパーの本気作『Hero Emblems』の完成度が匠レベル [ファミ通app].maff
Failed to unpack: 'cp950' codec can't decode byte 0xe3 in position 346: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0125_0030_02] ★樹莓派專賣店★【合併免運】7吋 LCD 液晶 顯示器 螢幕 屏幕 高分辨率 1080P 車載 Raspberry Pi - 露天拍賣-台灣 NO.1 拍賣網站.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2015_0125_0030_02] ★樹莓派專賣店★【合併免運】7吋 LCD 液晶 顯示器 螢幕 屏幕 高分辨率 1080P 車載 Raspberry Pi - 露天拍賣-台灣 NO.1 拍賣網站.maff
Failed to unpack: 'cp950' codec can't decode byte 0xe2 in position 357: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2016_1015_1634_08] ひたすら効率化を突き詰める楽しさ。ローグライクRPG『Crowntakers』登場 - 4月18日の新作ゲーム情報.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2016_1015_1634_08] ひたすら効率化を突き詰める楽しさ。ローグライクRPG『Crowntakers』登場 - 4月18日の新作ゲーム情報.maff
Failed to unpack: 'cp950' codec can't decode byte 0xe3 in position 358: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2016_1015_1636_45] 鮮やかな3D世界を飛ぶ『PixWing』9月10日発売。『This War of Mine』などのスタッフが参加するゲーム.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2016_1015_1636_45] 鮮やかな3D世界を飛ぶ『PixWing』9月10日発売。『This War of Mine』などのスタッフが参加するゲーム.maff
Failed to unpack: 'cp950' codec can't decode byte 0x82 in position 362: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2017_0102_1628_50] 吉田誠治 on Twitter_ _なお、ルネサンス以前には望遠圧縮が起こらないパースが使用されていて、一部で「天使の遠近法」と呼ばれています。対角線も全て一つの消失点に集.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2017_0102_1628_50] 吉田誠治 on Twitter_ _なお、ルネサンス以前には望遠圧縮が起こらないパースが使用されていて、一部で「天使の遠近法」と呼ばれています。対角線も全て一つの消失点に集.maff
Failed to unpack: 'cp950' codec can't decode byte 0xe5 in position 366: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2017_1105_0217_00] 女神転生シリーズのWindows用アクションゲームが無料公開中、実際にプレイしてみたレビュー - GIGAZINE.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2017_1105_0217_00] 女神転生シリーズのWindows用アクションゲームが無料公開中、実際にプレイしてみたレビュー - GIGAZINE.maff
Failed to unpack: 'cp950' codec can't decode byte 0xa5 in position 363: illegal multibyte sequence
Converting: D:\20230501 Maff Salvage\More Maffs\[2017_1106_2154_44] Acer製マイクロソフトMRヘッドセットの使用感とは?徹底レビュー _ Mogura VR - 国内外のVR_AR_MR最新情報.maff
Extract
Metadata 0
Metadata 1
Metadata 2
Error unpacking file: D:\20230501 Maff Salvage\More Maffs\[2017_1106_2154_44] Acer製マイクロソフトMRヘッドセットの使用感とは?徹底レビュー _ Mogura VR - 国内外のVR_AR_MR最新情報.maff
Failed to unpack: 'cp950' codec can't decode byte 0x83 in position 362: illegal multibyte sequence
All .maff files in the directory have been converted to .html.
HairySpoon commented 1 year ago

Thanks for running those diagnostics and yes, you correctly identified the applicable file.

This time the attached file contains (hopefully) a fix. Could you please repeat the process and report what happens. It works on my platform but that's not a valid test.

maff.py.zip

grapenavy commented 1 year ago

Thanks for the fix! The sample maffs were all successfully converted. Here's the log:

Converting .maff files in the directory: D:\20230501 Maff Salvage\More Maffs
This may take a while, depending on the number of files to convert...
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0103_0446_16] 訓令式羅馬字 - 維基百科,自由的百科全書.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0107_0326_11] Tumblr 前 CTO:Apple 作業系統品質大不如前 - Inside  網摘.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0108_0417_26] 3D 列印廠 MakerBot 發表新材質,可仿石灰石、楓木、金 屬質感 _ TechNews 科技新報.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0108_2039_10] 小心,香噴噴的「滷汁」變成「化學汁」?! _ PanSci 泛科學.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0120_1811_34] 【注目】開発2名! 台湾デベロッパーの本気作『Hero Emblems』の完成度が匠レベル [ファミ通app].maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2015_0125_0030_02] ★樹莓派專賣店★【合併免運】7吋 LCD 液晶 顯示器 螢幕 屏幕 高分辨率 1080P 車載 Raspberry Pi - 露天拍賣-台灣 NO.1 拍賣網站.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2016_1015_1634_08] ひたすら効率化を突き詰める楽しさ。ローグライクRPG『Crowntakers』登場 - 4月18日の新作ゲーム情報.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2016_1015_1636_45] 鮮やかな3D世界を飛ぶ『PixWing』9月10日発売。『This War of Mine』などのスタッフが参加するゲーム.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2017_0102_1628_50] 吉田誠治 on Twitter_ _なお、ルネサンス以前には望遠圧縮が起こらないパースが使用されていて、一部で「天使の遠近法」と呼ばれています。対角線も全て一つの消失点に集.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2017_1105_0217_00] 女神転生シリーズのWindows用アクションゲームが無料公 開中、実際にプレイしてみたレビュー - GIGAZINE.maff
Reading Metadata...
Converting: D:\20230501 Maff Salvage\More Maffs\[2017_1106_2154_44] Acer製マイクロソフトMRヘッドセットの使用感とは?徹底レビュー _ Mogura VR - 国内外のVR_AR_MR最新情報.maff
Reading Metadata...
All .maff files in the directory have been converted to .html.

However, I did notice some differences when compared to the originals. While some seem to be unavoidable, I'll include some of the differences I saw below, maybe you can tell what might be the cause:

Sample1: The differences should be obvious here

Sample2: This is pretty subtle but the dot in the footer is a different unicode character. The converted html uses "‧" and the maff uses "‧". While it's trivial here, and I'm unfamiliar with how it works, but I've had experiences in the past where something like this means many other potential differences

Sample3: At first I thought the video thumbnail and social media buttons came from external sources, but after disconnecting my internet and clearing the browser cache, the thumbnails and buttons still appeared for the maff

Sample4: The missing images (however the original's layout seems to already be messed up)

Sample5: The info bar seems to be affecting the layout here

samples3.zip