Open BradKML opened 2 years ago
Very good question. What part of structure are you referring to? When I was importing content from EPUB and DOCX into our editing system was to do some analysis. This was mostly needed because not that many people used proper styles in Word and EPUB was usually result of some conversion with 3rd party tools which created strange output. Otherwise I would do proper import knowing that h tags were used for headers and blockquote for quotes for instance.
So after import I would cleanup everything I don't need and try to figure out if they used span or div with CSS for creating headers and etc. For DOCX I would analyse the size of the font and etc. but for EPUB I would do something simple as if it is one short line and then a lot of p or blocks of texts after it I would assume it is a line. If it was a lot of blocks of short text one after the other I would assume these are not titles and etc. It never worked properly but this was imported into the editing system where you could always change and reformat, so it was more then good for me.
What I would do is look how for instance Web Scraper for service instapaper.com, Rocket Readability or Readability in Safari works. I am sure there are Python projects which are trying to do that also. They seems to do fairly good work at cleaning up the garbage in the page and presenting only proper content without custom CSS. I guess that would be my start.
There are libraries out there that does KPE, but right now I wanted to fi nd a way to get a list of chapters from the EPUB so I can pipe them into KPE algorithms. Don't EPUBs store individual chapters separately? If it is <p>
wouldn't it be a paragraph rather than a chapter?
I know this is rather old, but to get the locations of the chapters, you want to parse the opf file. If there is any interest in this still, I can share my code for getting the urls for the chapters. Have to do a few different conditional checks because different epub creators use different naming schemes for the HTML files
I know this is rather old, but to get the locations of the chapters, you want to parse the opf file. If there is any interest in this still, I can share my code for getting the urls for the chapters. Have to do a few different conditional checks because different epub creators use different naming schemes for the HTML files
I'm interested.
I wound up abandoning that, because while I got all of the major epub creation software, I was still finding weirdly bespoke solutions. I found one book from a big 4 publisher where all of the html files were started with the author's name followed my a random hash. So I've taken to using Beautiful Soup to parse the html files in the epub file after using ebooklib to extract them, and then looking for things that indicate the presence of a chapter title or things that indicate that it isn't a chapter (like a copyright symbol)
You can find the code to it here: https://github.com/ashrobertsdragon/Ebook-conversion-to-Text-for-Machine-Learning/blob/main/ebook_conversion/epub_conversion.py
Currently, I am trying to use a keyword extractors to extract chapters and paragraphs to create a reading aid, but EPUB is particular tricky in structure. What can be done?