brobertson / Cariboo

Linux tools to convert Perseus TEI P4 documents into epub files.
2 stars 0 forks source link

Cariboo

Linux tools to convert Perseus TEI P4 documents into epub files.

Bug Report:

A full list of which documents are affecting by which bugs can be found here: https://docs.google.com/spreadsheet/ccc?key=0Aokb5XTFASILdGkwaTF4aUtQbGhKMkllenlSSVcwNnc

NO CONTENT:

In some of the documents there exists only a cover and flyleaf, without any content of the text itself. So far it seems to be due to a lack of div tags within the xml document. If we can insert a line of code within tei2epub which asks it to insert a generic div tag surrounding the text, this should solve the issue.

REPEATED CHAPTERS:

Once again due to div tags, chapters are being repeated throughout the epub document. The issue here lies in the separation of the document into individual html files. Tei2epub is obviously coded to create an html file for every div. However, when divs are nested into one another, it begins to repeat itself. For example, if a text look like this:

<div> <div> Hello! My name is George! </div> <div> I work in a shoe factory. </div> <div> I have three children. <div> Their names are Sarah, John and Mark. </div> </div> </div>

The final document would appear:

Hello! My name is George! I work in a shoe factory. I have three children. Their names are Sarah, John and Mark. I work in a shoe factory. I have three children. Their names are Sarah, John and Mark. Their names are Sarah, John and Mark.

I have the following code, which should fix the issue:

<xsl:if test="not(ancestor::div)"> </xsl:if>

If we can find exactly where the html files are being created, we should be able to slip this in and fix the issue.

PAGEBREAKS:

In many documents only a few lines/words are appearing on each page when viewed. While I'm not entirely sure why this is happening, it is likely due to either a paragraphing issue in the xml itself or another issue with generic div tags.

NO TOC:

This is not so much a bug as it is a formatting choice. We have written into tei2epub that certain Dramas should not have any sort of TOC. This was originally due to a titling issue wherein chapters were made every few lines with names such as "episode" etc... While we may wish to edit this later, this has been documented for us to distinguish between those that are affected by this code and those that aren't.

NO AUTHOR:

This could be simply due to the fact that we do not yet know the author of a work and therefore it is unlisted. However, in case this is due to an issue with the program it has been documented.

EXCESS BOLD:

In these documents the first few pages are entirely bolded as if they were being read as a heading by tei2epub. It is possible this is because the text is in fact coded as a heading and serves an intended purpose. Since I don't know which of these are intended and which are not, all have been recorded in the linked chart.

AUTHOR RUNS OFF:

Occasionally, the author's name is so long that it does not fit on the cover image. To fix this we simply need to define a maximum size for the authors name and code it to resize the text to fit. These edits should be found in the included bash script (tei_p4_to_epub_and_mobi.sh) which creates the cover image.

TITLE RUNS OFF:

In this case, much like the author, the title occasionally is simply too long to fit on the cover image. The same code which would fix this bug for the author's name should be applied to the title as well.

TEXT IN LINE-NUMBERS:

Within the text itself very rarely it seems to incorporate large sections of text into the css for line-numbers, changing its colour and orientation. These seem very unique and may be attributed to malformed xml code.

ODD TOC TITLES:

In some of the Tables of Contents, the listed titles are slightly longer and look peculiar. Someone with knowledge for the texts may be able to recognize if these are intended or simply a bug that needs to be addressed.

'MACHINE READABLE TEXT':

This text, while eliminated from the cover titles, still appears in the document title under the element, meaning on devices it is still listed with this text included. This needs to be removed so that it does not appear in the official title for the epub.</p></div> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>