BasioMeusPuga / Lector

Qt based ebook reader
GNU General Public License v3.0
1.51k stars 208 forks source link

Better toc generation for epub file #80

Closed bennyyip closed 5 years ago

bennyyip commented 5 years ago

I got an epub file(some novels in Chinese, can be download from here) with nested toc structure which lector cannot parse it correctly. Lector only shows me a toc of numbers, but the actul files structure and toc looks like:

λ tree | rg -v "(html)|(jpe?g)|css|epub"
.
├── 1
│   ├── content.opf
│   ├── images
│   ├── text
│   └── toc.ncx
├── 2
│   ├── content.opf
│   ├── text
│   └── toc.ncx
....

├── 6
│   ├── content.opf
│   └── toc.ncx
├── content.opf
├── META-INF
│   ├── calibre_bookmarks.txt
│   └── container.xml
├── mimetype
├── toc.ncx

image

I noticed that here(https://github.com/BasioMeusPuga/Lector/blob/master/lector/readers/read_epub.py#L213) gets a wrong toc file(1/toc.ncx, but should be toc.ncx in root).

There is an actively maintained epub parser libary(https://github.com/aerkalov/ebooklib) that could handle toc correctly. Maybe we can replace the simple parser in readers/read_epub with ebooklib.

BasioMeusPuga commented 5 years ago

I'll have a look at the files you mentioned. I'll probably have to redo the way toc.ncx files are looked at. As far as nested structures go, there's very little changing how they're going to be displayed within the program. I can probably put in some additional formatting within the drop down for a sub-chapter.

Speaking of that library, I actually used it in the initial days of writing this. I'm afraid it has way too many inconsistencies. The amount of code I had to write just to make the library work with a fraction of the epub structure variants exceeded the whole parser I have currently.

bennyyip commented 5 years ago

Thank you for the timely response. I am looking forward to this feature. I've tried many epub readers on Linux and stay with lector eventually. Keep on the good work!

BasioMeusPuga commented 5 years ago

The current release / git should do a much better job of parsing more complex structures. I still don't look at multiple toc.ncx files, but singular ones work.