danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
907 stars 121 forks source link

Garbled Chinese characters #384

Closed whistleho closed 5 months ago

whistleho commented 5 months ago
  1. Using the URL http://example.com as an example, use the "編輯分頁" function to add the Chinese text "測試" to the webpage, then click the save control (disk-like icon) to capture the webpage, you will see the newly captured webpage title in the sidebar. After clicking on that webpage title in the sidebar, you can see the normal Chinese text.

Capturing (document) [1966690559] https://example.com/ ... Saving data... Saved to "20240530071423487" Updating server index for item "20240530071423487"... Done.

  1. At this point, regardless of whether further editing has been done or not, after clicking the save control, when you click on that webpage title in the sidebar again, you will see that the original Chinese text has turned into garbled characters.

Saving (document) [1966690568] http://localhost:8080/20240530071423487/index.html ... Updated http://localhost:8080/20240530071423487/index.html Updating server index for item "20240530071423487"... Done.

  1. If you click the save control again at this point, an error message will appear.

Saving (document) [1966690568] http://localhost:8080/20240530071423487/index.html ... Fatal error: 無法儲存非 UTF-8 編碼的檔案 (http://localhost:8080/20240530071423487/index.html)。

In the above step 2 or 3, drag the index.html file in the captured dictionary, and the normal Chinese text is shown on the Chrome browser, but garbled characters are shown by clicking on that webpage title in the sidebar.

danny0838 commented 5 months ago

We cannot reproduce the issue following the steps you have provided.

Please provide:

  1. the version of your OS, browser, WSB extension, and PyWSB.
  2. the capture settings (you can copy from Capture as => Advanced)
  3. the config file of your backend server
whistleho commented 5 months ago
  1. Windows 10 專業版, WSB extension ver. 2.9.1, webscrapbook 2.3.3

  2. { "tasks": [ { "comment": "", "tabId": 1966690612, "title": "質權|法律百科 Legispedia", "url": "https://www.legis-pedia.com/dictionary/6215" } ], "bookId": "", "parentId": "root", "index": null, "mode": "", "delay": null, "options": { "capture.applet": "blank", "capture.audio": "save", "capture.backupForRecapture": true, "capture.base": "blank", "capture.canvas": "save", "capture.contentSecurityPolicy": "remove", "capture.deleteErasedOnCapture": true, "capture.deleteErasedOnSave": false, "capture.downLink.doc.delay": null, "capture.downLink.doc.depth": null, "capture.downLink.doc.mode": "source", "capture.downLink.doc.urlFilter": "", "capture.downLink.file.extFilter": "###image\n#bmp, gif, ico, jpg, jpeg, jpe, jp2, png, tif, tiff, svg\n###audio\n#aac, ape, flac, mid, midi, mp3, ogg, oga, ra, ram, rm, rmx, wav, wma\n###video\n#avc, avi, flv, mkv, mov, mpg, mpeg, mp4, wmv\n###archive\n#zip, rar, jar, bz2, gz, tar, rpm, 7z, 7zip, xz, jar, xpi, lzh, lha, lzma\n#/z[0-9]{2}|r[0-9]{2}/\n###document\n#pdf, doc, docx, xls, xlsx, ppt, pptx, odt, ods, odp, odg, odf, rtf, txt, csv\n###executable\n#exe, msi, dmg, bin, xpi, iso\n###any non-web-page\n#/(?!$|html?|xht(ml)?|php|py|pl|aspx?|cgi|jsp)(.*)/i", "capture.downLink.file.mode": "none", "capture.downLink.urlExtra": "", "capture.downLink.urlFilter": "###skip common logout URL\n/[/=]logout\b/i", "capture.downloadRetryCount": 3, "capture.downloadRetryDelay": 1000, "capture.downloadWorkers": 4, "capture.embed": "blank", "capture.favicon": "save", "capture.faviconAttrs": "", "capture.font": "save-used", "capture.formStatus": "keep", "capture.frame": "save", "capture.frameRename": true, "capture.helpers": "", "capture.helpersEnabled": false, "capture.image": "save", "capture.imageBackground": "save-used", "capture.insertInfoBar": false, "capture.linkUnsavedUri": false, "capture.mergeCssResources": true, "capture.noscript": "save", "capture.object": "blank", "capture.ping": "blank", "capture.prefetch": "remove", "capture.preload": "remove", "capture.prettyPrint": false, "capture.recordDocumentMeta": true, "capture.recordRewrites": false, "capture.referrerPolicy": "", "capture.referrerSpoofSource": false, "capture.remoteTabDelay": null, "capture.removeHidden": "none", "capture.resourceSizeLimit": null, "capture.rewriteCss": "url", "capture.saveAs": "folder", "capture.saveAsciiFilename": false, "capture.saveDataUriAsFile": true, "capture.saveDataUriAsSrcdoc": true, "capture.saveFileAsHtml": false, "capture.saveFilename": "%id%", "capture.saveFilenameMaxLenUtf16": 120, "capture.saveFilenameMaxLenUtf8": 240, "capture.saveFolder": "WebScrapBook/data", "capture.saveOverwrite": false, "capture.saveResourcesSequentially": false, "capture.saveTo": "server", "capture.script": "remove", "capture.serverUploadRetryCount": 3, "capture.serverUploadRetryDelay": 2000, "capture.serverUploadWorkers": 4, "capture.shadowDom": "save", "capture.style": "save", "capture.styleInline": "save", "capture.video": "save", "capture.zipCompressLevel": null } }

  3. ; Run "wsb help config" for details

[app] ; name = WebScrapBook ; theme = default ; locale = ; root = . ; index = ; backup_dir = .wsb/backup ; content_security_policy = strict ; allowed_x_for = 0 ; allowed_x_proto = 0 ; allowed_x_host = 0 ; allowed_x_port = 0 ; allowed_x_prefix = 0

[book ""] name = scrapbook top_dir = data_dir = data tree_dir = tree index = tree/map.html no_tree = false new_at_top = false inclusive_frames = true static_index = false rss_root = rss_item_count = 50

; [auth "user1"] ; user = myuser1 ; pw = pbkdf2:sha256:1000$jlbk3RVDGdR6TVDvRAie3HPSMejGTedw$2f3935a508c20c63c5bcdf7d96d853fc1df7d6dcab824e43a2aa2570bfcd0bef ; permission = all

; [auth "user2"] ; user = myuser2 ; pw = ; permission = read

[server] ; port = 8080 ; host = localhost ; ssl_on = true ; ssl_key = ./wsb/webscrapbook.key ; ssl_cert = ./wsb/webscrapbook.crt ; browse = false

[browser] ; command = ; cache_prefix = webscrapbook. ; cache_expire = 259200 ; use_jar = false

danny0838 commented 5 months ago

We still cannot reproduce the issue.

Please try creating a new clean browser user/profile, install WSB extension, and try if the issue persists.

If it still persists, please also provide the details about how you add Chinese text in the page after invoking "編輯頁面", (e.g. provide a short video).

whistleho commented 5 months ago

Thank you for your reminder. After testing each extension one by one, I discovered that the pCloud extension was causing the issue. After disabling it, everything is back to normal. Thank you for your assistance.

danny0838 commented 5 months ago

Thank you for following up. Close issue as resolved.