danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
850 stars 118 forks source link

Merge capture broken on case-sensitive filesystems. #349

Closed iiv3 closed 6 months ago

iiv3 commented 11 months ago

To start a merge capture you do a "normal capture" with "Depth to capture linked pages:" set to 0 or more. This creates index.json file in the capture folder that holds association between already stored files and their URL.

The problem is that while the files are stored with their original case on the filesystem, the index.json file lists the path to the file in lowercase.

  {
   "path": "yrsrrjgo_bigger.jpg",
   "url": "https://pbs.twimg.com/profile_images/[...]/yRsRRjGO_bigger.jpg",
   "role": "resource",
   "token": "7572b9496cbc2240fb195353af28194a83afe821"
  },

On a second capture/merge, that path is been used to replace already captured resources. As result the resource cannot be found if it's original URL name contained an upper-case symbol and the OS file search is case sensitive.

The capture option "Save ASCII filename" doesn't seem to have any effect in this case. Same for "Save data URL as file"

I'm running WebScrapBook 2.0.4 extension on Chromium 114 (Linux).

danny0838 commented 11 months ago

To prevent a potential conflict among different filesystems, all files will be saved as all-lower-case by WSB. Just keep this in mind and search the all-lower-case version filename (or search by URL in the index.json and then the corresponding saved filename).

iiv3 commented 11 months ago

There is no lower-case version of the file saved. That's why the merge result is broken.

And please, preserve the original case of the file. It's the correct way to handle case-sensitive filesystem. Otherwise you are going to open much bigger can of worms. Aka, URL are case sensitive.

danny0838 commented 11 months ago

There is no lower-case version of the file saved. That's why the merge result is broken.

This should not happen. If it does please provide a real case example.

And please, preserve the original case of the file. It's the correct way to handle case-sensitive filesystem. Otherwise you are going to open much bigger can of worms. Aka, URL are case sensitive.

This is for cross-platform compatibility. First it's not possible for the browser to detect whether the filesystem is case sensitive or not. Additionally there will be a problem when files are moved to a case-insensitive filesystem if both "image.jpg" and "IMAGE.JPG" exist.

iiv3 commented 11 months ago

How should I provide real case example? The one I've provided is from capture of the profile page of the current Twitter's owner.

You don't need to change the filename if there is no conflict. Different "image.jpg" could exist in multiple captured pages. You do handle that type of conflict, don't you?

Apparently, the current code already has both case-sensitive and lower-case filenames in different lists.

I wouldn't advice changing case of non-ascii filenames.

danny0838 commented 11 months ago

I cannot get the exact issue from your description.

Please provide a reproducible case and the exact steps to reproduce the issue, such as the source URL, the steps you run the capture (and the capture options), and how you perform the merge capture, and what's wrong in the result, etc.

iiv3 commented 11 months ago

I see why you can't reproduce it.

I use "OldTwitter" extension that always loads the same original size image, so images are the same. If you follow my instructions you get multiple version with different sizes, and they all get their own new files.

Let me find a simpler site.

iiv3 commented 11 months ago

Ok, the linux graphics server site is simple enough.

First capture, the home page.

{
 "tasks": [
  {
   "comment": "",
   "fullPage": true,
   "tabId": 526183075,
   "title": "X.Org",
   "url": "https://www.x.org/wiki/"
  }
 ],
 "bookId": "Temp",
 "parentId": "root",
 "index": null,
 "mode": "",
 "delay": null,
 "options": {
  "capture.applet": "blank",
  "capture.audio": "save-current",
  "capture.backupForRecapture": true,
  "capture.base": "blank",
  "capture.canvas": "save",
  "capture.contentSecurityPolicy": "remove",
  "capture.deleteErasedOnCapture": true,
  "capture.deleteErasedOnSave": true,
  "capture.downLink.doc.delay": null,
  "capture.downLink.doc.depth": 0,
  "capture.downLink.doc.mode": "source",
  "capture.downLink.doc.urlFilter": "",
  "capture.downLink.file.extFilter": "",
  "capture.downLink.file.mode": "none",
  "capture.downLink.urlExtra": "",
  "capture.downLink.urlFilter": "",
  "capture.downloadRetryCount": 3,
  "capture.downloadRetryDelay": 1000,
  "capture.embed": "blank",
  "capture.favicon": "save",
  "capture.faviconAttrs": "",
  "capture.font": "link",
  "capture.formStatus": "keep",
  "capture.frame": "save",
  "capture.frameRename": true,
  "capture.helpers": "",
  "capture.helpersEnabled": false,
  "capture.image": "save-current",
  "capture.imageBackground": "save-used",
  "capture.insertInfoBar": false,
  "capture.linkUnsavedUri": true,
  "capture.mergeCssResources": true,
  "capture.noscript": "save",
  "capture.object": "blank",
  "capture.ping": "blank",
  "capture.prefetch": "remove",
  "capture.preload": "remove",
  "capture.prettyPrint": false,
  "capture.recordDocumentMeta": true,
  "capture.recordRewrites": false,
  "capture.referrerPolicy": "strict-origin-when-cross-origin",
  "capture.referrerSpoofSource": false,
  "capture.remoteTabDelay": 300,
  "capture.removeHidden": "undisplayed",
  "capture.resourceSizeLimit": null,
  "capture.rewriteCss": "url",
  "capture.saveAs": "folder",
  "capture.saveAsciiFilename": false,
  "capture.saveDataUriAsFile": true,
  "capture.saveDataUriAsSrcdoc": true,
  "capture.saveFileAsHtml": false,
  "capture.saveFilename": "%create-Y%.%create-m%/%id%_%source-host%",
  "capture.saveFilenameMaxLenUtf16": 120,
  "capture.saveFilenameMaxLenUtf8": 240,
  "capture.saveFolder": "WebScrapBook/data",
  "capture.saveOverwrite": false,
  "capture.saveResourcesSequentially": false,
  "capture.saveTo": "server",
  "capture.script": "remove",
  "capture.serverUploadRetryCount": 3,
  "capture.serverUploadRetryDelay": 2000,
  "capture.serverUploadWorkers": 4,
  "capture.shadowDom": "save",
  "capture.style": "save",
  "capture.styleInline": "save",
  "capture.video": "save-current",
  "capture.zipCompressLevel": null
 }
}

The merge capture on the second link in the first paragraph "The X.Org Foundation" that leads to an "about" page

{
 "tasks": [
  {
   "fullPage": true,
   "mergeCaptureInfo": {
    "bookId": "Temp",
    "itemId": "20230813170453252"
   },
   "tabId": 526183075,
   "url": "https://www.x.org/wiki/XorgFoundation/"
  }
 ],
 "bookId": "Temp",
 "parentId": "20230813170453252",
 "index": null,
 "mode": "",
 "delay": null,
 "options": {
  "capture.applet": "blank",
  "capture.audio": "save-current",
  "capture.backupForRecapture": true,
  "capture.base": "blank",
  "capture.canvas": "save",
  "capture.contentSecurityPolicy": "remove",
  "capture.deleteErasedOnCapture": true,
  "capture.deleteErasedOnSave": true,
  "capture.downLink.doc.delay": null,
  "capture.downLink.doc.depth": 0,
  "capture.downLink.doc.mode": "source",
  "capture.downLink.doc.urlFilter": "",
  "capture.downLink.file.extFilter": "",
  "capture.downLink.file.mode": "none",
  "capture.downLink.urlExtra": "",
  "capture.downLink.urlFilter": "",
  "capture.downloadRetryCount": 3,
  "capture.downloadRetryDelay": 1000,
  "capture.embed": "blank",
  "capture.favicon": "save",
  "capture.faviconAttrs": "",
  "capture.font": "link",
  "capture.formStatus": "keep",
  "capture.frame": "save",
  "capture.frameRename": true,
  "capture.helpers": "",
  "capture.helpersEnabled": false,
  "capture.image": "save-current",
  "capture.imageBackground": "save-used",
  "capture.insertInfoBar": false,
  "capture.linkUnsavedUri": true,
  "capture.mergeCssResources": true,
  "capture.noscript": "save",
  "capture.object": "blank",
  "capture.ping": "blank",
  "capture.prefetch": "remove",
  "capture.preload": "remove",
  "capture.prettyPrint": false,
  "capture.recordDocumentMeta": true,
  "capture.recordRewrites": false,
  "capture.referrerPolicy": "strict-origin-when-cross-origin",
  "capture.referrerSpoofSource": false,
  "capture.remoteTabDelay": 300,
  "capture.removeHidden": "undisplayed",
  "capture.resourceSizeLimit": null,
  "capture.rewriteCss": "url",
  "capture.saveAs": "folder",
  "capture.saveAsciiFilename": false,
  "capture.saveDataUriAsFile": true,
  "capture.saveDataUriAsSrcdoc": true,
  "capture.saveFileAsHtml": false,
  "capture.saveFilename": "%create-Y%.%create-m%/%id%_%source-host%",
  "capture.saveFilenameMaxLenUtf16": 120,
  "capture.saveFilenameMaxLenUtf8": 240,
  "capture.saveFolder": "WebScrapBook/data",
  "capture.saveOverwrite": false,
  "capture.saveResourcesSequentially": false,
  "capture.saveTo": "server",
  "capture.script": "remove",
  "capture.serverUploadRetryCount": 3,
  "capture.serverUploadRetryDelay": 2000,
  "capture.serverUploadWorkers": 4,
  "capture.shadowDom": "save",
  "capture.style": "save",
  "capture.styleInline": "save",
  "capture.video": "save-current",
  "capture.zipCompressLevel": null
 }
}

When you go to the merged page in the archive, the "donate" buttons have lost their images. There are small square image placeholder and a text.

I have to remind you, that if you run the WSB server on OS that is not case-sensitive, it will manage to find the files with the changed name.

danny0838 commented 11 months ago

OK. I get it. We may need further investigation for a solution, though.

danny0838 commented 11 months ago

v2.1.0 should have fixed the issue.