Case-sensitive check for dupes in ZIP

borjimur / epubcheck

Automatically exported from code.google.com/p/epubcheck

MIT License

0 stars 0 forks source link

Case-sensitive check for dupes in ZIP #284

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

PROBLEM:

Epubcheck notify about "Duplicate entry in the ZIP file" error when EPUB 
contains files with names, that are different in case.

METHOD:

1. Create epub file (or use attached file) that contains files "Contents.html" 
and "contents.html".
2. Run Epubcheck on it

RESULT:

ERROR: tmp.epub: Duplicate entry in the ZIP file: OPS/contents.xhtml

EXPECTED:

No "Duplicate entry..." error when file names are different only in case.

PLATFORM:

Epubcheck 3.0.1 on Linux x86_64, 
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

ADDITIONAL INFO:

http://www.idpf.org/epub/30/spec/epub30-ocf.html#sec-container-filenames

Original issue reported on code.google.com by pm2...@gmail.com on 18 Jun 2013 at 1:08

Attachments:

case_check.epub

GoogleCodeExporter commented 8 years ago

But as noted in the referenced section:

All File Names within the same directory must be unique following case 
normalization as described in section 3.13 of [Unicode].

Original comment by mgarrish on 25 Jun 2013 at 1:00

GoogleCodeExporter commented 8 years ago

Thanks for explanation. For some reason I overlooked this sentence :( 

Issues could be closed as "invalid".

Few notes (maybe I missed something, because I haven't read full Unicode 
standard). 
I've looked for "case normalization" in section 3.13 
(http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G33992) and haven't 
found description for this term, but section 3.13 contains description of 
"Default Caseless Matching". 
Most time in the Unicode standard "normalization" means using one of 
Normalization Forms (http://www.unicode.org/reports/tr15/) that aren't change 
case. But there is "case folding" process (used in 3.13. "Default Caseless 
Matching" and described in 5.18, 
http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf#G21180) that "map strings 
to canonical form where case differences are erased".

Original comment by pm2...@gmail.com on 25 Jun 2013 at 5:56

GoogleCodeExporter commented 8 years ago

Agree the wording of that requirement could be improved, but I believe it's 
referring to the normalization step refenced in "Default Case Matching":

"Caseless matching should also use normalization, which means using one of the 
following operations:"

Original comment by mgarrish on 25 Jun 2013 at 12:25

GoogleCodeExporter commented 8 years ago

Right. That's how I interpreted the spec: file names must be unique when 
comparing with a caseless matching algorithm. If not, EpubCheck will raise an 
error.

EpubCheck will also check Unicode-normalized names and only raise a warning 
when duplicate names are found.

I'm closing the issue as invalid. Feel free to re-open if you disagree.

Original comment by rdeltour@gmail.com on 25 Jun 2013 at 1:39

Changed state: Invalid