bramstein / typeset

TeX line breaking algorithm in JavaScript
BSD 2-Clause "Simplified" License
988 stars 75 forks source link

Flatland example has a line that wraps badly #22

Open chris-morgan opened 8 years ago

chris-morgan commented 8 years ago
An screenshot demonstrating the problem

If you are unable to reproduce it I’m quite willing to assist further in great detail. I have a project I believe this will work very well for and I want to help iron out any issues like this so people that use my software don’t run into this sort of thing ever.

PhilterPaper commented 3 years ago

Playing with typeset on Windows 10/Firefox, I noticed the same bad break. Also notice that the continuation line "ed traveller..." is extremely tight. As the previous (short) line is already fairly loose, I think you could probably fit the entire "absent-minded" on that line, which would save a line and help the extremely tight following line loosen up. It might look even better to split "inconsiderate" and put "in-" on the loose line before. Then there would be plenty of room to move "minded" up a line, and the following line (without "ed") would be loosened. To get the desired indentation and space around the figure, you might have to extend the list of short line lengths. Other than that one thing, it looks pretty decent.

I wonder if this example used Knuth-Liang hyphenation? I don't think it's supposed to word split at the last 2 letters (3 is the minimum).

My purpose in looking at typeset is that I have just taken over the Perl port of it from Simon Cozens (see PhilterPaper/text-knuthplass, and Text::KnuthPlass on CPAN). In some examples I added, I see some poor break behavior, such as three lines in a row with split (hyphenated) words, including the penultimate line. My understanding of Knuth-Plass is that there should be large penalties for hyphenating two or more lines in a row, as well as for the next-to-last line. I need to go through and see if there were any bug fixes or enhancements missed by Simon leading up to the 2011-03-17 release of his port, as well as work that Bram has done since. Further discussion or suggestions are welcome here or on PhilterPaper/text-knuthplass issues.

The same bad line wrap shows up in the frobnitzem/typeset fork.

PhilterPaper commented 3 years ago

A slight possibility is that insufficient demerits are applied at the hard hyphen break (see #27). It might be interesting to revisit this issue once the demerits value is increased from 100 to something approaching 3000 (better yet, configurable).

PhilterPaper commented 3 years ago

I changed every '100' to '3000' (except percentage calculations) in flatland/index.html and src/*.js, and it appeared to have no effect (at least, it didn't clear up this problem). Either I missed something or this wasn't the cause.

PhilterPaper commented 2 years ago

I took another look at this, after changing the hyphenation penalty back to 100 (from 3000). It appears that the hyphenation may have been getting messed up by compound words such as absent-minded, so I added code to lib/hypher.js (src/hypher.js for frobnitzem/typeset) to handle hyphenated words like that:

    var characters,
        characterPoints = [],
        words = [],          <=== new
        compound = [],  <=== new
        originalCharacters,
...
        result = [''];    <=== existing code

   // handle compound words made up of simple words separated by hyphens.
   // do similar code for other compound-word joiners, or just treat any
   // non-letter sequence as a joiner. don't forget accented letters and SHY,
   //   as well as non-breaking spaces etc.
   if (word.indexOf('-') !== -1) {
       words = word.split(/-/g); // list of simple words in compound word
       compound = [];
       for (i=0; i<words.length; i++) {
            // - at beginning of word, or at -- in word? 0-length words[i]
            if (words[i].length == 0 && i < words.length-1) {
                compound.push('-');  // just an empty word with existing hyphen
            } else {
                j=this.hyphenate(words[i]);
                compound.push(...j);
                if (i < words.length-1) { // don't add hyphen to last simple word
                    compound[compound.length-1] = compound[compound.length-1]+'-';
                }
            }
        }
        return compound;
    } // --- end of new code

    if (this.exceptions.hasOwnProperty(word)) {   <=== existing code

Yeah, it's an ugly hack, but I detest Javascript because the diagnostics and error messages are literally non-existent. If someone would like to improve it, be my guest. It appears that the hypenateText() function above it is not called at all, which is a shame, as it appears to contain code intended to properly handle compound words! Also, take heed of the note about just the ASCII hyphen (U+002D) being supported -- a more general case would handle /, :, etc. -- maybe any punctuation sequence as a joiner. Also watch out for accented letters and non-breaking punctuation, and handle &SHY; correctly.

It looks like the hyphenation code is really doing a bad job. I saw absent-minded broken up as ab.sen.t-.mind.ed, magazines as ma.ga.zi.nes, and pentagonal as pen.tag.on.al. There are probably more that I overlooked. I don't know if Bram isn't quite using the Knuth-Liang algorithm, or if there's a problem with the patterns, but you should be careful if using this code for serious work (where bad word-breaking would make you look bad).

Finally, the code works (more or less) with the minimum suffix (minimum tail of the word) set at 2 letters. As English prefers a minimum of 3, I tried that (lib/en-us.js, src/pattern.js in frobnitzem/typeset), but then this one paragraph (narrowed for Figure 2) showed only three lines and skipped 10 blank ones, resuming (no text lost) after the figure. The frobnitzem/typeset Flatland example behaves the same way -- both push 13 short line lengths onto the Line Lengths list, but for some reason nothing shows up for 10 lines!

Unfortunately, none of this work managed to fix the original problem. At this point I'm going to take a break, and may look at it later. Much later.

PhilterPaper commented 2 years ago

This thing was gnawing at me, as I want to check my Perl code against this package, so I took another look at it. I have concluded that the Knuth-Plass code appears to be working correctly, but something in the resulting HTML and CSS isn't behaving exactly as expected when the browser gets it. Here is the offending paragraph, as the "output" array before it's smooshed into an HTML string and replaces the old paragraph text:

0: "<span style=\"margin-left: 20px;\"></span>"   <== paragraph indentation 20px
​1: "<span style=\"word-spacing: -1px;\">Square&nbsp;</span>"
​2: "<span style=\"word-spacing: 0px;\">and&nbsp;triangular&nbsp;houses&nbsp;are&nbsp;not </span>"
​3: "<span style=\"word-spacing: 0px;\">allowed,&nbsp;and&nbsp;for&nbsp;this&nbsp;reason.&nbsp;The&nbsp;angles </span>"
​4: "<span style=\"word-spacing: 2px;\">of&nbsp;a&nbsp;Square&nbsp;</span>"
​5: "<span style=\"word-spacing: 1px;\">(and&nbsp;still&nbsp;more&nbsp;those&nbsp;of&nbsp;an </span>"
​6: "<span style=\"word-spacing: 2px;\">equilateral&nbsp;</span>"
​7: "<span style=\"word-spacing: 1px;\">Triangle,)&nbsp;being&nbsp;much&nbsp;more </span>"
​8: "<span style=\"word-spacing: 1px;\">pointed&nbsp;than&nbsp;</span>"
​9: "<span style=\"word-spacing: 2px;\">those&nbsp;of&nbsp;a&nbsp;Pentagon,&nbsp;and </span>"
​10: "<span style=\"word-spacing: 1px;\">the&nbsp;</span>"
​11: "<span style=\"word-spacing: 2px;\">lines&nbsp;of&nbsp;inanimate&nbsp;objects&nbsp;(such&nbsp;as </span>"
​12: "<span style=\"word-spacing: 4px;\">houses)&nbsp;being&nbsp;dimmer&nbsp;than&nbsp;the&nbsp;lines </span>"
​13: "<span style=\"word-spacing: 5px;\">of&nbsp;Men&nbsp;</span>"
​14: "<span style=\"word-spacing: 6px;\">and&nbsp;Women,&nbsp;it&nbsp;follows&nbsp;that </span>"
​15: "<span style=\"word-spacing: 2px;\">there&nbsp;is&nbsp;</span>"
​16: "<span style=\"word-spacing: 1px;\">no&nbsp;little&nbsp;danger&nbsp;lest&nbsp;the&nbsp;points </span>"
​17: "<span style=\"word-spacing: 4px;\">of&nbsp;a&nbsp;square&nbsp;</span>"
​18: "<span style=\"word-spacing: 5px;\">or&nbsp;triangular&nbsp;house&nbsp;</span>"
​19: "<span style=\"word-spacing: 4px;\">resi&shy;dence&nbsp;might&nbsp;do&nbsp;</span>"
​20: "<span style=\"word-spacing: 5px;\">serious&nbsp;injury&nbsp;to&nbsp;an </span>"
​21: "<span style=\"word-spacing: 3px;\">inconsiderate&nbsp;</span>"
​22: "<span style=\"word-spacing: 2px;\">or&nbsp;perhaps&nbsp;</span>"
​23: "<span style=\"word-spacing: -2px;\">absent-mind&shy;ed&nbsp;</span>"
​24: "<span style=\"word-spacing: -1px;\">traveller&nbsp;suddenly&nbsp;therefore,&nbsp;running </span>"
​25: "<span style=\"word-spacing: 2px;\">against&nbsp;them:&nbsp;and&nbsp;as&nbsp;early&nbsp;</span>"
​26: "<span style=\"word-spacing: 1px;\">as&nbsp;the&nbsp;eleventh&nbsp;century&nbsp;of&nbsp;our&nbsp;era,&nbsp;triangular&nbsp;houses </span>"
​27: "<span style=\"word-spacing: 0px;\">were&nbsp;universally&nbsp;forbidden&nbsp;</span>"
​28: "<span style=\"word-spacing: 1px;\">by&nbsp;Law,&nbsp;the&nbsp;only&nbsp;exceptions&nbsp;being&nbsp;fortifications,&nbsp;</span>"
​29: "<span style=\"word-spacing: 2px;\">pow&shy;der-magazines,&nbsp;</span>"
​30: "<span style=\"word-spacing: 1px;\">barracks,&nbsp;and&nbsp;other&nbsp;state&nbsp;buildings,&nbsp;which&nbsp;it&nbsp;is&nbsp;not&nbsp;desirable&nbsp;that </span>"
​31: "<span style=\"word-spacing: 0px;\">the&nbsp;general&nbsp;public&nbsp;should&nbsp;approach&nbsp;without&nbsp;circumspection.</span>"

It's doing a bunch of stuff which seems rather inefficient. I don't think that word-spacing requires integer numbers of px's, or why spaces are replaced with nbsp's. I'm also not sure what controls the line lengths. I have tried the following hand-coded replacement, using KP's line splitting, with good results:

<span style="margin-left: 20px;"></span>
<span style="max-width: 263px; word-spacing: -0.2px;">Square and triangular houses are not</span><br/>
<span style="max-width: 263px; word-spacing: 0px;">allowed, and for this reason. The angles</span><br/>
<span style="max-width: 263px; word-spacing: 1.3px;">of a Square (and still more those of an</span><br/>
<span style="max-width: 263px; word-spacing: 1.5px;">equilateral Triangle,) being much more</span><br/>
<span style="max-width: 263px; word-spacing: 2.1px;">pointed than those of a Pentagon, and</span><br/>
<span style="max-width: 263px; word-spacing: 1.8px;">the lines of inanimate objects (such as</span><br/>
<span style="max-width: 263px; word-spacing: 4.4px;">houses) being dimmer than the lines </span><br/>
<span style="max-width: 263px; word-spacing: 5.2px;">of Men and Women, it follows that</span><br/>
<span style="max-width: 263px; word-spacing: 2.2px;">there is no little danger lest the points</span><br/>
<span style="max-width: 263px; word-spacing: 5.0px;">of a square or triangular house resi-</span><br/>
<span style="max-width: 263px; word-spacing: 4.9px;">dence might do serious injury to an</span><br/>
<span style="max-width: 263px; word-spacing: 2.8px;">inconsiderate or perhaps absent-mind-</span><br/>
<span style="max-width: 263px; word-spacing: -0.1px;">ed traveller suddenly therefore, running</span><br/>
<span style="max-width: 534px; word-spacing: 2.0px;">against them: and as early as the eleventh century of our era, triangular houses</span><br/>
<span style="max-width: 534px; word-spacing: 0.6px;">were universally forbidden by Law, the only exceptions being fortifications, pow-</span><br/>
<span style="max-width: 534px; word-spacing: 1.5px;">der-magazines, barracks, and other state buildings, which it is not desirable that</span><br/>
<span style="max-width: 534px; word-spacing: 0px;">the general public should approach without circumspection.</span><br/>

Note that the line length is specified per line, so the browser has nothing to do with it. In both cases, I think the <p> and </p> tags are preserved, and the stuff in-between rebuilt. There may be some additional HTML and/or CSS in the existing code, to control the different line lengths.