KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
39 stars 3 forks source link

Convert epub files to static HTML sites #70

Open tshrinivasan opened 5 years ago

tshrinivasan commented 5 years ago

We like to host HTML version of all the FreeTamilEbooks.com ebooks.

Take any epub file and convert to static HTML site.

Sample site - https://smoking.pressbooks.com/

Get similar HTML output from any given epub file.

like

Table of contents, clickable One HTML page for each chapter.

iashok22 commented 5 years ago

@tshrinivasan tried building it in wordpress itself - https://unridable-binders.000webhostapp.com/ebook/aathichoodi Please review. BTW it is not static HTML site, it renders epub file content in this reader

tshrinivasan commented 5 years ago

We need to Create HTML version of all the ebooks and publish in a website, for easy available on search engines.

Explore on how to convert epub to html and publish as websites.

tshrinivasan commented 5 years ago

@iashok22 Will this wordpress epub content be available for search engines? Can we share a link of a chapter?

Can we host 100s of epubs in a single wordpress? or do we need one wordpress for one epub?

narendranss commented 5 years ago

@tshrinivasan - you reported about a few converters which are converting but has a lot of div soups. Can you please provide some sample epub and converted Html? I went through a few HTML parsers maybe we can pre-process converted HTML and create a multi-staging solution.

tshrinivasan commented 5 years ago

@narendranss download any epub from http://freetamilebooks.com/ebooks/thinnai_kadhaigal/

extract it like a zip file and check the html files and their sources.

tshrinivasan commented 4 years ago

met pitchaimuthu today.

He told that to help to convert the books like https://thanithamizhakarathikalanjiyam.github.io/v2/kanmani_tamil_9_4

tshrinivasan commented 4 years ago

sample request

http://freetamilebooks.com/download/%e0%ae%ae%e0%ae%be%e0%ae%b0%e0%af%8d%e0%ae%95%e0%ae%b4%e0%ae%bf%e0%ae%a4%e0%af%8d-%e0%ae%a4%e0%ae%bf%e0%ae%99%e0%af%8d%e0%ae%95%e0%ae%b3%e0%af%8d-%e0%ae%ae%e0%ae%a4%e0%ae%bf-%e0%ae%a8%e0%ae%bf/

Download a sample epub from here

try validating with http://validator.idpf.org/ and https://www.ebookit.com/tools/bp/Bo/eBookIt/epub-validator

we covert a doc/docx/odt as epub. libre office is adding too many tags on the HTML files. We need to tidy them. explore on auto tidying.

We need headings, bold, italic, images, tables only on the result files. remove all other unwanted tags.

Duplicate of #24

tshrinivasan commented 4 years ago

Found that tidy command in linux, cleans up the html beautifully.

sudo apt-get install tidy

-clean, -c replace FONT, NOBR and CENTER tags with CSS (clean: yes) -gdoc, -g produce clean version of html exported by Google Docs (gdoc: yes) -indent, -i indent element content (indent: auto)

tidy -clean -indent -gdoc wikisource.html > w.html

julientaq commented 4 years ago

Did you try pandoc?

Since it works on linux terminal, you could bash convert and get the html pretty much cleaned for all epub. You can even update it by using filters written in LUA: https://pandoc.org/lua-filters.html

You could try following this guide: https://opensource.com/article/18/10/book-to-website-epub-using-pandoc

(Please note that you don’t really need the markdown, because the epub would be your source, so you should be pretty much set).

BharathLenin commented 4 years ago

Working on this using Pandoc

https://pandoc.org/

BharathLenin commented 4 years ago

@tshrinivasan Have done a very high level implementation of Pandoc to convert the epub file to a static html file with table of contents, footer as additional generated metadata information.

https://github.com/BharathLenin/epub_to_html

nifey commented 2 years ago

Hello @tshrinivasan, In one of the ILUGC meetup you mentioned this issue of converting epub to static webpages similar to Read the docs.

During FOSSHack (last weekend), Me and my friends, we wrote epub2sphinx, a CLI tool that converts a given epub into ReST files (using pandoc) which can then be converted to static HTML using sphinx. The advantage of converting to ReST instead of converting directly to HTML is that we can now use any sphinx theme.

It's not yet perfect, there are still improvements to be made. This tool will be helpful for making the FTE books available to be read online. Some screenshots with an FTE book:

Screenshot 2021-11-16 at 22-48-06 எளிய தமிழில் Computer Vision — எளிய தமிழில் Computer Vision documentation Screenshot 2021-11-16 at 22-48-19 Front page — எளிய தமிழில் Computer Vision documentation Screenshot 2021-11-16 at 22-47-40 வண்ண மாதிரிகள் (Color models) — எளிய தமிழில் Computer Vision documentation

tshrinivasan commented 2 years ago

Wonderful.

Thanks for the great help.

Share the code link also.

nifey commented 2 years ago

The code is available at https://github.com/nifey/epub2sphinx

tshrinivasan commented 1 year ago

https://github.com/OpenBookPublishers/epublius

from openbookpublishers.com seems does the same. will explore this too.

ThangaAyyanar commented 3 months ago

https://richardwong.io/post/tools/2024-04-28-sun-epubs-and-html/

This article contains code written in python using beautiful soap to convert epub to html. We can explore this option too.

sample code: https://git.richardwong.io/richard/epub_to_html/src/branch/main