HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
617 stars 178 forks source link

Investigate using EPUB for ebooks #1390

Open j9t opened 4 years ago

j9t commented 4 years ago

The 2019 ebook refers to and depends on links, but none seem to be present at least in the version on Google Play Books.

Screenshots attached for both mobile and desktop. I haven’t checked the whole books but so far, nothing seems accessible. (This doesn’t seem intended—if it was, please consider links at least for author information and for anything indicated as a link, as with expressions starting with “http”.)

Screen Shot 2020-10-28 at 17 35 41

Screenshot_20201028-162047

Screenshot_20201028-162237


Also some centering issues as noted in #1391

rviscomi commented 4 years ago

@tunetheweb seems like the URLs in the footnotes were all removed by Books. Any ideas to workaround that?

tunetheweb commented 4 years ago

Oh looks like they are gone in the PDF version too!: https://almanac.httparchive.org/static/pdfs/web_almanac_2019_en.pdf

The links still work (at least on PDFs) but the footnotes showing the URL are gone.

Will take a look.

tunetheweb commented 4 years ago

I tell a lie - they are still there online. Pheww.

@rviscomi, I don't show foot notes when the full URL is shown as seems a bit redundant.

So my profile for example at the bottom of the HTTP/2 chapter has this:

Barry Pollard profile

But only includes the hidden link to my book, at the bottom of the page:

Footnote 46

There is no footnote URL link to my social media icons, nor the Twitter account and website in the text. This was for presentational reasons as otherwise we ended up with loads and loads of footnotes that looked really untidy. To me the URL is obvious from those links so felt better to hide.

It appears Google Books does include the footnotes, so that's good. @j9t I presume that is what you are showing in your 3rd screenshot wiht Una's links showing?

However it looks like it removes the clickable links themselves 😞 Both form the original link and the footnotes. The PDF version has both links in the text, and in the footnotes as clickable, which is much nicer.

I suspect this is to do with the auto conversion of PDF to EPUB. I did look at converting our PDF to EPUB (using Calibre) but didn't get nice results. The table of contents for example is below:

Calibre EPUB conversion

So on one hand I'm impressed that Google Books did such a good job on converting so it looks nice. But unfortunately it doesn't retain links. I don't think we can solve this until we find a decent PDF -> EPUB conversion tool.

j9t commented 4 years ago

On the chance that I can contribute somehow:

  1. For PDF to EPUB conversion there must be many tools—could some other tool do the job?

  2. What’s the source material—Markdown, HTML? I can’t tell how much work that would be to switch there, but Leanpub is one example for where that conversion works really well, generating decently formatted books from HTML or Markdown into PDF, EPUB, and MOBI. I swear on it, and the output is definitely compatible with Google Books.

Just on the chance this can be useful—I can tell that much already went into this.

tunetheweb commented 4 years ago

Just found this note: https://support.google.com/books/partner/answer/107073?hl=en&ref_topic=3238502

Hyperlinks If your PDF contains hyperlinks, either to other parts of the same book or to external websites, please note that the links will be disabled when your book is processed.

Looks like it is supported for EPUB though: https://support.google.com/books/partner/answer/3316879?hl=en&ref_topic=3238502

The source is HTML: https://almanac.httparchive.org/en/2019/ebook and CSS Page Media. We then use PrinceXML to convert to PDF.

I'm open to ideas on better EPUB converters. Calibre seemed a recommended free on last time I looked but, as I say, results weren't great.

rviscomi commented 3 years ago

Renaming the issue to focus on the EPUB format which should solve the centering issue in #1391 and the clickable links issue reported here.