brechtm / rinohtype

The Python document processor
http://www.mos6581.org/rinohtype
GNU Affero General Public License v3.0
498 stars 59 forks source link

The first word on a line is never hyphenated #416

Closed jwhitham closed 11 months ago

jwhitham commented 1 year ago

Is there an existing issue for this?

PDF produced by rinohtype

target.pdf

On page 5 there are two long lines. The first line consists of a single long word ("/Example/Demo/Example...") which overflows the right-hand side of the page.

The second line consists of a short word and the same long word (i.e. "word" then "/Example/Demo/Example..."). In this case, the second word is split onto two lines with a hyphen.

Expected behavior: both of these long words ought to be split onto multiple lines with hyphens

Actual behavior: if the long word is the first word in a line, then a long word is not hyphenated.

I think that this is the same issue reported in https://github.com/brechtm/rinohtype/issues/188 . Adding zero-width spaces to the text will avoid the problem (though it introduces another problem, see https://github.com/brechtm/rinohtype/issues/415 ). However, as the long word can be hyphenated, it would be better if it could just be hyphenated - regardless of whether it is the first word, second word, or any other word.

The problem also occurs if all or part of a long word becomes the first word in a line as a result of earlier overflows. The last line on page 5 has this problem: notice that the word spills into the right margin. It ought to be hyphenated again, splitting over three lines, but it is not.

The problem can happen to short words too. When a very long word is in the second column of a table, the first column may be "squeezed", becoming so narrow that even a relatively short word needs to be hyphenated. However, if that word is the first word on the line, it can't be hyphenated, so it overflows into the second column.

I think https://github.com/brechtm/rinohtype/blob/b7be22f68c78fe3e21de39949d51c7b474d1ac1a/src/rinoh/paragraph.py#L1104 is possibly the place which introduces different behavior for the first word in a line. Is there any way that such a word could be hyphenated?

Source files

no-hyphenation-for-first-word.zip

The bug can be reproduced by running "demo.bat".

Versions

c:\doctools\.venv\lib\site-packages\rinoh\resource.py:44: UserWarning: The stylesheet 'sphinx' is also provided by:
* rinohtype
Using the one from 'rinohtype'
  warn("The {} '{}' is also provided by:\n".format(cls.resource_type,
rinohtype 0.5.4 (2022-06-17)
Sphinx 7.0.1
Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)]
Windows-10-10.0.19041-SP0
brechtm commented 1 year ago

Thanks for the detailed bug reports, @jwhitham.

Unfortunately, nowadays I'm unlikely to spend much free time on rinohtype. Since you are using rinohtype in a commercial setting, and assuming it is providing value to your company, there are some options for getting these issues fixed in a timely manner:

Both options would help to keep the project sustainable.

jwhitham commented 1 year ago

Thanks. I'm grateful for your support. I have created a simple pull request for issue 415.

For this issue, I did write a possible fix, but I'm not happy with the code quality, and I think for now I'd rather deal with this problem using the workaround of inserting zero-width spaces, which seems to work fairly well for the documents I have done so far.

I will ask about the possibility of sponsoring your project within the company.

brechtm commented 11 months ago

The current master branch will now automatically split "words" at slashes and it also fixes hyphenation of the first word on a line. See for example hyphenation.pdf.

There would be benefit in handling splitting separately for paths, URLs and regular text, but that requires semantic information.