Feature request: split large text files into several smaller files

GoogleCodeExporter commented 9 years ago

Some producers will have existing content with only one really huge (3MB+) text 
content document (XHTML or DTBook format). We should have an option in our 
conversion scripts to split this into several smaller files in the EPUB output. 
Having several smaller text files improves performance dramatically in reading 
systems.

For example: DAISY to EPUB scripts could behave in one of several ways:

1. split the text content document based on the table of contents, and create 
one file per top-level chapter could be an option.

2. split the text content document based on KB. This would be more difficult 
because it would probably be done with a java extension step to Xproc. However, 
in some cases, it may produce better results than #1.

Original issue reported on code.google.com by marisa.d...@gmail.com on 30 Apr 2013 at 11:45

GoogleCodeExporter commented 9 years ago

Original comment by marisa.d...@gmail.com on 30 Apr 2013 at 11:45

GoogleCodeExporter commented 9 years ago

Good idea.

At NLB we do something similar when we produce PEF from DTBook. Since the 
resulting PEF files can span thousands of pages, we split the book into 
booklets so that it's easier to send through the postal service. I haven't 
actually played around with it myself but the algorithm is something like:

1. if the book is too big; split into booklets on h1
2. if the booklet is still too big; split on h2
3. if the booklet is still too big; split on h3
4. if the booklet is still too big; split on h4
5. if the booklet is still too big; split on h5
6. if the booklet is still too big; split on h6
7. if the booklet is still too big; split on paragraphs
8. if the booklet is still too big; split on words

For books with page breaks we could split the book into pages; or at least try 
to split the book at the closest pagebreak.
For the KB-based approach we could always use XPath to count characters instead.

An issue when splitting a content document might be the CSS files. They would 
probably still work in most cases, but if for instance there's a "first-child" 
CSS selector used to apply some special styling to the first chapter then that 
would suddenly apply to all chapters.

A common utility step for updating SMIL references to the content files would 
be useful when splitting DAISY 2.02, DAISY 3 and EPUB3.

Original comment by josteinaj@gmail.com on 1 May 2013 at 9:43

GoogleCodeExporter commented 9 years ago

FWIW, DAISY 3 to EPUB 3 should already create one content documents for each 
top-level "section"  element in the resulting HTML. Such elements are created 
for front/body/back matters and also for level1 in the body matter.

I intend to do this for HTML to EPUB 3 too, but only if the input is a single 
HTML. If the input is a sequence of HTML, I would assume it's already chunked 
at the right size.

For now, the chunking heuristics is quite naïve. I agree that checking the 
file size would be better, but as said above it is significantly more complex.

Original comment by rdeltour@gmail.com on 7 May 2013 at 12:08

GoogleCodeExporter commented 9 years ago

Romain,

I hope all is well.  I was playing with the latest command line client for 
Pipeline 2 to see the level1 file splitting.  I took this sample DTBook file:

http://www.daisy.org/sample-content#t4

And reordered the level2 navigation to be level1, so there would be multiple 
level1s.  You can download that file from here:

https://dl.dropboxusercontent.com/u/39156804/you.zip

I did not get multiple xhtml files when I ran it through the dtbook to epub3 
converter.

Any idea why?

Gerardo

Original comment by gerar...@benespace.org on 28 Jun 2013 at 10:28

GoogleCodeExporter commented 9 years ago

Original comment by marisa.d...@gmail.com on 2 Jul 2013 at 12:35

GoogleCodeExporter commented 9 years ago

It occurred that zedai-to-epub3 intenral code was never updated to use the 
revamped
HTML chunking facility from the html-utils, resulting in poor chunking in 
dtbook-to-epub3 and zedai-to-epub3.

This is fixed in pipeline-scripts PR #21:
https://github.com/daisy-consortium/pipeline-scripts/pull/21

Original comment by rdeltour@gmail.com on 17 Jul 2013 at 9:41

GoogleCodeExporter commented 9 years ago

I'm closing this one. I created a new issue for the split-by-size functionality:
https://code.google.com/p/daisy-pipeline/issues/detail?id=351

Original comment by rdeltour@gmail.com on 17 Jul 2013 at 9:45

Changed state: Fixed

fatty- / daisy-pipeline

Feature request: split large text files into several smaller files #309