Closed iG8R closed 4 years ago
@iG8R
@dteviot
::==== begin batchscript
@echo off
for %%a in (*.xhtml) do (
>x echo %%~na
copy x+"%%a">nul
move /y x "%%a"
)
::==== script ends.
After all, agree, it's aesthetically pleasing to see full names without any three dots. 00011-_Primor...ang_Locket.xhtml vs 00011-_Primordial_Dream,_Ying_Yang_Locket.xhtml
@iG8R I'm not willing to modify WebToEub to do what you've asked. Because, in the general case, I can't guarantee the URLs for each chapter can be used. For example, there was one site where the URLs were like
where xx whas a number. So, if you tried to pack multiple volumes, the filenames would conflict. So, I add the numeric prefix to make sure the filenames are unique. Note, for many sites, the prefix is probably not needed. However, the code is much simpler if it just adds the prefix in all cases, rather than trying to figure out if it's needed or not. As regards truncating the last path element in the URL, that's a bit more complicated. The short answer is, I'm reusing a library function that has OS sensitivity.
Anyway, there's a couple of solutions.
@dteviot Sorry, there is some misunderstanding, I didn't ask about keeping either URLs as filenames, or discarding prefixes, I've asked about preserving full filenames for xhtml files in the epub file, e.g. to keep full names in red rectangle without truncating to three dots. How could it make conflicts with other filenames?
As I mentioned above, it's more convenient and pleasing to have the following filename:
00011-_Primordial_Dream,Ying_Yang_Locket.xhtml
then this one: 00011-Primor...ang_Locket.xhtml
Small remark: The relevant function is "safeForFileName" in file https://github.com/dteviot/WebToEpub/blob/master/plugin/js/Util.js
var safeForFileName = function (title) {
if(title) {
// Allow only a-z regardless of case and numbers as well as hyphens and underscores; replace spaces with underscores
title = title.replace(/ /gi, "_").replace(/([^a-z0-9_-]+)/gi, "");
// There is technically a 255 character limit in windows for file paths.
// So we will allow files to have 20 characters and when they go over we split them
// we then truncate the middle so that the file name is always different
return (title.length > 20 ? title.substr(0, 10) + "..." + title.substr(title.length - 10, title.length) : title);
}
return "";
}
If there is a 255 character limit in windows for file paths, why 20 is set as a length limit? Is it possible in filenames allow all characters except those that are forbidden in Windows? For instance, regarding characters:
title = title.replace(/([ <>":;\/|?*])/gi, "_");
regarding filenames:
return (title.length > 150 ? title.substr(0, 137) + "..." + title.substr(title.length - 10, title.length) : title);
(IMHO, 150 would be enough for the filename length)
var safeForFileName = function (title) {
if(title) {
// Allow only a-z regardless of case and numbers as well as hyphens and underscores; replace spaces with underscores
title = title.replace(/ /gi, "_").replace(/([^a-z0-9_-]+)/gi, "");
// There is technically a 255 character limit in windows for file paths.
// So we will allow files to have 20 characters and when they go over we split them
// we then truncate the middle so that the file name is always different
return (title.length > 150 ? title.substr(0, 137) + "..." + title.substr(title.length - 10, title.length) : title);
}
return "";
}
PS. I realized it's not a good idea to allow all characters except those that are forbidden in Windows, so let it be as it was:
title = title.replace(/ /gi, "_").replace(/([^a-z0-9_-]+)/gi, "");
@iG8R
If there is a 255 character limit in windows for file paths, why 20 is set as a length limit?
The total length, including directories, is 255 characters. Length of 20 was somewhat arbitrary. Note, this function was originally used for setting the initial name of the epub file. It then got re-used to "massage" the names of the xhtml files in the epub.
Sorry, there is some misunderstanding, I didn't ask about keeping either URLs as filenames, or discarding prefixes, I've asked about preserving full filenames for xhtml files in the epub file, e.g. to keep full names in red rectangle without truncating to three dots. How could it make conflicts with other filenames?
The red box ISN'T the where WebToEpub gets the file name from. It comes from the URL. The bit you've crossed out. Specifically, WebToEpub takes the last portion of the path. e.g. if the URL looked like http://hostname/story-11/Volume-1/Chapter-1.html, then WebToEpub will use "Chapter-1" as the start point for building the xhtml filename. The conflict come from some sites that have multiple volumes, so you'll see URLs like
i.e. There's two URLs which end with Chapter-1.html, so these would conflict. Note, the chapter title (the bit you put in the red box) is used by WebToEpub in the Table of Contents.
So, I think what you're asking is something like: "How to put the relevant title from the table of contents into the start of each chapter's XHTML file."
I'd suggest writing a Calibre extension to do that. See: https://manual.calibre-ebook.com/polish.html#module-calibre.ebooks.oeb.polish.toc
Aside, I don't know enough about Calibre scripting to fully understand what the script you provided is doing. I think it's trying to insert the name of the XHTML file into the start of each of the XHTML files?
The red box ISN'T the where WebToEpub gets the file name from. It comes from the URL. The bit you've crossed out. Specifically, WebToEpub takes the last portion of the path. e.g. if the URL looked like http://hostname/story-11/Volume-1/Chapter-1.html, then WebToEpub will use "Chapter-1" as the start point for building the xhtml filename. The conflict come from some sites that have multiple volumes, so you'll see URLs like
http://hostname/story-11/Volume-1/Chapter-1.html http://hostname/story-11/Volume-1/Chapter-2.html http://hostname/story-11/Volume-2/Chapter-1.html
i.e. There's two URLs which end with Chapter-1.html, so these would conflict. Note, the chapter title (the bit you put in the red box) is used by WebToEpub in the Table of Contents.
Hm... It's somewhat weird. PS. I've already modified "safeForFileName" hence xhtml files now are with full filenames.
What epub file has inside itself:
What http://liberspark.com/novel/grasping-evil has in its table of content:
URLs for chapters:
http://liberspark.com/read/grasping-evil/31
http://liberspark.com/read/grasping-evil/32
http://liberspark.com/read/grasping-evil/chapter-33
http://liberspark.com/read/grasping-evil/chapter-34
http://liberspark.com/read/grasping-evil/chapter-35
Titles for relevant chapters:
Grasping Evil - 31 | LiberSpark
Grasping Evil - 32 | LiberSpark
Grasping Evil - Chapter 33 | LiberSpark
Grasping Evil - Chapter 34 | LiberSpark
Grasping Evil - Chapter 35 | LiberSpark
So, how could it be the following?
The red box ISN'T the where WebToEpub gets the file name from. It comes from the URL.
According what I've wrote above WebToEpub exactly gets the file name for the xhtml files from "red box", i.e. from the Table of content.
Regarding batchscript - it's ordinary CMD batch file (*.bat) and yes, it inserts the name of the xhtml file into the start of the according xhtml file.
Why I need to do all of this - as I mentioned in the beginning, some novels have chapters without any signs of either chapter number or name or both of them inside the page except mentioning them in the table of content. As xhtml files get its names from the table of content I use them to insert the chapter number and its name at the very beginning of the corresponding xhtml file (even before <?xml version='1.0' encoding='utf-8'?>) and ultimately I use regular expressions to insert them into missing parts in the page.
@iG8R D`oh! You are correct. For Images, the filename in the zip is taken from the URL. https://github.com/dteviot/WebToEpub/blob/19372c7209691d851a0b5c3da1a33d4820b8a9a1/plugin/js/EpubItem.js#L169-L173
But for chapters, the filename uses the chapter title. https://github.com/dteviot/WebToEpub/blob/19372c7209691d851a0b5c3da1a33d4820b8a9a1/plugin/js/EpubItem.js#L25-L28
Checking the spec for EPUB, http://www.idpf.org/doc_library/epub/OCF_2.0.1_draft.doc, section 3.3
File Names MUST NOT exceed 255 bytes
(It also allows UTF-8, but recommends restricting to the ASCII subset in names. And there's a bunch of forbidden characters.)
As I mentioned before, the safeForFileName() function was for writing file to the host OS. It was reused to handle issues with "legal characters in a ZIP filename". But in THAT context, the length restriction is unnecessary.
So, as you've suggested, I'll adjust the length restriction for the filenames inside the zip.
I'll probably do it this weekend.
Thank you!:) Is it possible to add an option to insert chapter title in the beginning of the xhtml files in case when it needed?
@iG8R No. Doing that results in an file that is not valid XHTML and 2 of the epub readers I have refuse to display it.
@iG8R New version pushed to Chrome and Mozilla stores. Your version should automatically update within 24 hours.
Hi. Is it possible to preserve full file names for xhtml files in the epub file? What I mean:
while parsing chapters' names are:
after composing epub file, the names of xhtml files are: