dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
648 stars 124 forks source link

Preserving full file names for chapters #287

Closed iG8R closed 4 years ago

iG8R commented 4 years ago

Hi. Is it possible to preserve full file names for xhtml files in the epub file? What I mean:

dteviot commented 4 years ago

@iG8R

  1. Why do you want the original filenames kept?
  2. The original URLs are recorded in the content.opf file in the epub. (Note, you'll need to match the id attribute of the <item> elements to the id of the <dc:source> elements.
iG8R commented 4 years ago

@dteviot

  1. Quit a few novels have chapters without any signs of either chapter number or name or both of them inside the page except mentioning them in the page's name. For instance, above mentioned "Grasping Evil" (http://liberspark.com/novel/grasping-evil) doesn't have chapter numbers in the text of almost any chapter, only chapter names. It would be convenient to have full file names of xhtml files to insert lacking data (chapter number or name or both of them) into the text with the help of the following batchscript and regex expressions in Calibre or Sigil.
    ::==== begin batchscript
    @echo off
    for %%a in (*.xhtml) do (
    >x echo %%~na
    copy x+"%%a">nul
    move /y x "%%a"
    )
    ::==== script ends.

    After all, agree, it's aesthetically pleasing to see full names without any three dots. 00011-_Primor...ang_Locket.xhtml vs 00011-_Primordial_Dream,_Ying_Yang_Locket.xhtml

dteviot commented 4 years ago

@iG8R I'm not willing to modify WebToEub to do what you've asked. Because, in the general case, I can't guarantee the URLs for each chapter can be used. For example, there was one site where the URLs were like

http://hostname/story-xxxx/Vxx/Cxx.html

where xx whas a number. So, if you tried to pack multiple volumes, the filenames would conflict. So, I add the numeric prefix to make sure the filenames are unique. Note, for many sites, the prefix is probably not needed. However, the code is much simpler if it just adds the prefix in all cases, rather than trying to figure out if it's needed or not. As regards truncating the last path element in the URL, that's a bit more complicated. The short answer is, I'm reusing a library function that has OS sensitivity.

Anyway, there's a couple of solutions.

  1. You could modify WebToEpub yourself to not truncate the file. The relevant function is "makeStorageFileName" in file util.js. It's called from EpubItem.js.
  2. As I said in my previous post, the full URL is included in content.opf. So, you can use that to go from the "compressed' name in the epub to the original name. Writing a small program to change the names of the files in the epub (it's a zip file) is reasonably easy. https://github.com/dteviot/EpubEditor Is a project of mine using html + javascript code to unpack and repack a zip file. It should be easy to modify it to rename files. Alternately, if you're more comforatble with C#, I've got a shell for packing & unpacking epubs in that. I can send you a copy of the code.
iG8R commented 4 years ago

@dteviot Sorry, there is some misunderstanding, I didn't ask about keeping either URLs as filenames, or discarding prefixes, I've asked about preserving full filenames for xhtml files in the epub file, e.g. to keep full names in red rectangle without truncating to three dots. How could it make conflicts with other filenames?

GraspingEvil_2019-08-09_111430_2

As I mentioned above, it's more convenient and pleasing to have the following filename:

00011-_Primordial_Dream,Ying_Yang_Locket.xhtml

then this one: 00011-Primor...ang_Locket.xhtml

  1. Thank you! That is what I've asked for:)
  2. Thank you again! All manipulation with epubs I make with TotalCommander - for me that is more than enough. Sorry, but I can't understand why it needs so many efforts - "uncompress" epub and writing a program, when it only needs to modify the "makeStorageFileName" function in file util.js. IMHO all other users also will be happy to see good-for-eyes names of xhtml files but not scrambled ones.
iG8R commented 4 years ago

Small remark: The relevant function is "safeForFileName" in file https://github.com/dteviot/WebToEpub/blob/master/plugin/js/Util.js

    var safeForFileName = function (title) {
        if(title) {
            // Allow only a-z regardless of case and numbers as well as hyphens and underscores; replace spaces with underscores
            title = title.replace(/ /gi, "_").replace(/([^a-z0-9_-]+)/gi, "");
            // There is technically a 255 character limit in windows for file paths. 
            // So we will allow files to have 20 characters and when they go over we split them 
            // we then truncate the middle so that the file name is always different
            return (title.length > 20 ? title.substr(0, 10) + "..." + title.substr(title.length - 10, title.length) : title);
        }
        return "";
    }

If there is a 255 character limit in windows for file paths, why 20 is set as a length limit? Is it possible in filenames allow all characters except those that are forbidden in Windows? For instance, regarding characters:

title = title.replace(/([ <>":;\/|?*])/gi, "_");

regarding filenames:

return (title.length > 150 ? title.substr(0, 137) + "..." + title.substr(title.length - 10, title.length) : title);

(IMHO, 150 would be enough for the filename length)

    var safeForFileName = function (title) {
        if(title) {
            // Allow only a-z regardless of case and numbers as well as hyphens and underscores; replace spaces with underscores
            title = title.replace(/ /gi, "_").replace(/([^a-z0-9_-]+)/gi, "");
            // There is technically a 255 character limit in windows for file paths. 
            // So we will allow files to have 20 characters and when they go over we split them 
            // we then truncate the middle so that the file name is always different
            return (title.length > 150 ? title.substr(0, 137) + "..." + title.substr(title.length - 10, title.length) : title);
        }
        return "";
    }

PS. I realized it's not a good idea to allow all characters except those that are forbidden in Windows, so let it be as it was:

title = title.replace(/ /gi, "_").replace(/([^a-z0-9_-]+)/gi, "");
dteviot commented 4 years ago

@iG8R

If there is a 255 character limit in windows for file paths, why 20 is set as a length limit?

The total length, including directories, is 255 characters. Length of 20 was somewhat arbitrary. Note, this function was originally used for setting the initial name of the epub file. It then got re-used to "massage" the names of the xhtml files in the epub.

Sorry, there is some misunderstanding, I didn't ask about keeping either URLs as filenames, or discarding prefixes, I've asked about preserving full filenames for xhtml files in the epub file, e.g. to keep full names in red rectangle without truncating to three dots. How could it make conflicts with other filenames?

The red box ISN'T the where WebToEpub gets the file name from. It comes from the URL. The bit you've crossed out. Specifically, WebToEpub takes the last portion of the path. e.g. if the URL looked like http://hostname/story-11/Volume-1/Chapter-1.html, then WebToEpub will use "Chapter-1" as the start point for building the xhtml filename. The conflict come from some sites that have multiple volumes, so you'll see URLs like

i.e. There's two URLs which end with Chapter-1.html, so these would conflict. Note, the chapter title (the bit you put in the red box) is used by WebToEpub in the Table of Contents.

So, I think what you're asking is something like: "How to put the relevant title from the table of contents into the start of each chapter's XHTML file."

I'd suggest writing a Calibre extension to do that. See: https://manual.calibre-ebook.com/polish.html#module-calibre.ebooks.oeb.polish.toc

Aside, I don't know enough about Calibre scripting to fully understand what the script you provided is doing. I think it's trying to insert the name of the XHTML file into the start of each of the XHTML files?

iG8R commented 4 years ago

The red box ISN'T the where WebToEpub gets the file name from. It comes from the URL. The bit you've crossed out. Specifically, WebToEpub takes the last portion of the path. e.g. if the URL looked like http://hostname/story-11/Volume-1/Chapter-1.html, then WebToEpub will use "Chapter-1" as the start point for building the xhtml filename. The conflict come from some sites that have multiple volumes, so you'll see URLs like

http://hostname/story-11/Volume-1/Chapter-1.html
http://hostname/story-11/Volume-1/Chapter-2.html
http://hostname/story-11/Volume-2/Chapter-1.html

i.e. There's two URLs which end with Chapter-1.html, so these would conflict. Note, the chapter title (the bit you put in the red box) is used by WebToEpub in the Table of Contents.

Hm... It's somewhat weird. PS. I've already modified "safeForFileName" hence xhtml files now are with full filenames.

What epub file has inside itself:

GraspingEvil_2019-08-12_153128

What http://liberspark.com/novel/grasping-evil has in its table of content:

GraspingEvil_2019-08-12_153327

URLs for chapters:

http://liberspark.com/read/grasping-evil/31
http://liberspark.com/read/grasping-evil/32
http://liberspark.com/read/grasping-evil/chapter-33
http://liberspark.com/read/grasping-evil/chapter-34
http://liberspark.com/read/grasping-evil/chapter-35

Titles for relevant chapters:

Grasping Evil - 31 | LiberSpark
Grasping Evil - 32 | LiberSpark
Grasping Evil - Chapter 33 | LiberSpark
Grasping Evil - Chapter 34 | LiberSpark
Grasping Evil - Chapter 35 | LiberSpark

So, how could it be the following?

The red box ISN'T the where WebToEpub gets the file name from. It comes from the URL.

According what I've wrote above WebToEpub exactly gets the file name for the xhtml files from "red box", i.e. from the Table of content.

Regarding batchscript - it's ordinary CMD batch file (*.bat) and yes, it inserts the name of the xhtml file into the start of the according xhtml file.

Why I need to do all of this - as I mentioned in the beginning, some novels have chapters without any signs of either chapter number or name or both of them inside the page except mentioning them in the table of content. As xhtml files get its names from the table of content I use them to insert the chapter number and its name at the very beginning of the corresponding xhtml file (even before <?xml version='1.0' encoding='utf-8'?>) and ultimately I use regular expressions to insert them into missing parts in the page.

dteviot commented 4 years ago

@iG8R D`oh! You are correct. For Images, the filename in the zip is taken from the URL. https://github.com/dteviot/WebToEpub/blob/19372c7209691d851a0b5c3da1a33d4820b8a9a1/plugin/js/EpubItem.js#L169-L173

But for chapters, the filename uses the chapter title. https://github.com/dteviot/WebToEpub/blob/19372c7209691d851a0b5c3da1a33d4820b8a9a1/plugin/js/EpubItem.js#L25-L28

Checking the spec for EPUB, http://www.idpf.org/doc_library/epub/OCF_2.0.1_draft.doc, section 3.3

File Names MUST NOT exceed 255 bytes

(It also allows UTF-8, but recommends restricting to the ASCII subset in names. And there's a bunch of forbidden characters.)

As I mentioned before, the safeForFileName() function was for writing file to the host OS. It was reused to handle issues with "legal characters in a ZIP filename". But in THAT context, the length restriction is unnecessary.
So, as you've suggested, I'll adjust the length restriction for the filenames inside the zip. I'll probably do it this weekend.

iG8R commented 4 years ago

Thank you!:) Is it possible to add an option to insert chapter title in the beginning of the xhtml files in case when it needed?

dteviot commented 4 years ago

@iG8R No. Doing that results in an file that is not valid XHTML and 2 of the epub readers I have refuse to display it.

dteviot commented 4 years ago

@iG8R New version pushed to Chrome and Mozilla stores. Your version should automatically update within 24 hours.