Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.09k stars 674 forks source link

Text repeated after some line breaks #2016

Closed MortalWombat-repo closed 9 months ago

MortalWombat-repo commented 9 months ago

Hello.

I must thank everyone involved as what was supposed to be a simple project turned out to be a nightmare. I tried most html to pdf libraries and so far this one seems to be most maintained ad keeping true to conventional styling ie. as i would use CTRL + P on a webpage.

File I'm using is parsed wikipedia html using wikipedia package. I named it wiki_page.html. I used a local html as i originally used pypandoc to convert html to epub but ran into problems with pdflatex and later wkhtmltopdf from pdfkit. I decided I wouldnt use pypandoc for pdf as i am building a command line script and I cant expect users to download even MiKtex let alone anything else.

I will attach images of repeated text and cut table html file and pdf Grand Rapids, Michigan.pdf I suspect html file has syntax errors but I am not that well acquainted with that language.

Please let me know if i need to improve on this report and thank you for reading this issue.

image image https://www.mediafire.com/file/i935dpvn8t3lhlc/wiki_page.html/file

liZe commented 9 months ago

Hi, and thanks for the report!

The problem about tables is not a technical bug in WeasyPrint: browsers don’t resize tables automatically. The table’s content doesn’t fit in the page width, and so the table is larger than the page. In a browser you can have a scrolling bar, but in a PDF you can’t. The usual solution for this is to use smaller font sizes and paddings for tables, for printed media. You can also put tables on pages that have a different size (e.g. landscape).

The bug about repeating text is a real bug we can track in this issue.

MortalWombat-repo commented 9 months ago

Hi, and thanks for the report!

The problem about tables is not a technical bug in WeasyPrint: browsers don’t resize tables automatically. The table’s content doesn’t fit in the page width, and so the table is larger than the page. In a browser you can have a scrolling bar, but in a PDF you can’t. The usual solution for this is to use smaller font sizes and paddings for tables, for printed media. You can also put tables on pages that have a different size (e.g. landscape).

The bug about repeating text is a real bug we can track in this issue.

Hi thank you for replying. : )

I have gone through many different libraries and this one is by far the easiest. Right now I am running a solution with Beautiful Soup, and tidy but it is very messy and not nearly as good.

Can you suggest some options for me to try to fit the page content and not have repeated text? I have gone through various configurations of A4, Letter, margin sizes and width : 100%. I think i can not have A4 as it is fixed. How would i use the flags to get highest div with txt and highest div with table for tables specifically.

Is there a way for me to detect in weasyprint when a content is too big and then configure it to landscape, but just that page that goes over? I don't think I could do it in BeautifulSoup or with Flask and Jinja as I struggle in that as it is.

I know the best solution in my case is to use a headless instance but I already have so many dependencies I could not ask of someone to go through configuring that too.

I've gone through this repo and as it seems there is no flexbox support? I guess that would be hard to implement without webkit.

It is also worth to mention that that render was done from a pip wikipedia package and html method on the wikipedia object. When i write it to a file i get a lot of CSS errors "{ expected". I also get those when i use beautiful soup, which is why i tried to clean it up with Tidy. I guess that is from the wikipedias side as it gets a lot of edits that arent stylisticly similar but HTML renders even with syntax errors. Could that be the reason, as I found out that wkhtmltopdf was very opinionated and refused to work without absolute paths everywhere and syntax errors?

I have been reading this repo for a few days and i saw many had issues with td not being broken and similar syntax errors.

This repo has been like the light at the end of the tunnel as I am struggling for a whole week just to get pdf to work. Interestingly epub in pypandoc doesn't complain, but pdflatex doesn't work as wkhtmltopdf is deprecated.

Sorry if i was a bit verbose i got my question rejected on Stackoverflow as it wasn't descriptive enough.

MortalWombat-repo commented 9 months ago

Hi, and thanks for the report!

The problem about tables is not a technical bug in WeasyPrint: browsers don’t resize tables automatically. The table’s content doesn’t fit in the page width, and so the table is larger than the page. In a browser you can have a scrolling bar, but in a PDF you can’t. The usual solution for this is to use smaller font sizes and paddings for tables, for printed media. You can also put tables on pages that have a different size (e.g. landscape).

The bug about repeating text is a real bug we can track in this issue.

Sorry I read this again. I already have so much documentation and implementations in my head from this past week that it is hard to focus.

Do you have any thoughts about repeating text? Could it be because i get a lot of CSS errors of "} expected" when i request the site?

When i validate it doesn't seem like anything should be repeating the text https://validator.w3.org/nu/?doc=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FGrand_Rapids%2C_Michigan

MortalWombat-repo commented 9 months ago

I hit a problem as i couldn't test this again. Every time i tried to reproduce i got a page with edit hyperlinks from html, and then i figured out that i used the get from url in that example.

When i get through url the result is a beautiful text rendering, although images aren't respected and broken in the middle.

But i noticed that if the page breaks on the header, the header will repeat along with any text.

Can I somehow solve this or is this a Pango thing?

I don't understand why it renders differently an html file compared to a url. Are different libraries being used? From URL image From file image image

Don't worry abut the chars being misrepresented i fixed it in beautiful soup content = r.content.decode('utf-8', 'ignore') content = content.encode("ascii", "ignore") content = content.decode()

this is a hackish solution and it took me a while to find out.

liZe commented 9 months ago

I have gone through many different libraries and this one is by far the easiest.

Good to know!

Can you suggest some options for me to try to fit the page content

There’s no magical solution. Using a smaller font sizes and reducing paddings for tables is often enough.

and not have repeated text?

Repeated text is a bug in WeasyPrint and will be fixed really soon: the fix is already written, I’ll just add some tests to avoid any regressions, and commit everything.

Is there a way for me to detect in weasyprint when a content is too big and then configure it to landscape, but just that page that goes over?

There’s no way to detect the size in CSS, but you can for example use the number of columns to set landscape pages, with something like:

@page big-table {
  size: A4 landscape;
}
table:has(td:nth-child(6)) {
  page: big-table;
}

(Not tested for real, but you get the idea.)

There are other possible hacks, but it goes further than what we should talk about in a bug report!

I've gone through this repo and as it seems there is no flexbox support?

Flexbox is supported (but its support is far from perfect) in flex.py.

Could that be the reason, as I found out that wkhtmltopdf was very opinionated and refused to work without absolute paths everywhere and syntax errors?

It’s common to find non-valid HTML in Wikipedia, but browsers are often designed to do anything they can to render something, even with these errors.

This repo has been like the light at the end of the tunnel as I am struggling for a whole week just to get pdf to work. Interestingly epub in pypandoc doesn't complain, but pdflatex doesn't work as wkhtmltopdf is deprecated.

I hope that you’ll finally get the rendering you want!!

this is a hackish solution and it took me a while to find out.

Your HTML files are probably encoded with UTF-8 and you can configure WeasyPrint to use it as well with the encoding parameter. You don’t have the problem using HTTP because HTTP gives WeasyPrint the encoding to use in its headers.

MortalWombat-repo commented 9 months ago

Thank you very much you really answered everything I need. Very glad that this repo has enthusiastic maintainers.

I don't want to sound too bold but i got the impression that pdf is if not evil then downright menacing and cruel. :) Makes sense i guess you have a blank canvas and doing the best you can based on guesses and then you compress it.

I gave a whirl to Prince XML in the meantime and got surprised that even it rendered the page 1:1 based on html and not the styling.

I am very confused but intrigued i spent a week on something that should have been a couple hours at most since i am annoyed if don't understand something.

Why is the output so much different from url and from html? It doesn't seem like you are using a headless instance. It seems all of your work is coming from a Flask custom url fetcher. Sorry for such quick observations but the parsing from url is excellent and i cant figure out why.

I would love to be able to plug in a file and get a similar result.

And if I also may ask for some clarification for "smaller font sizes and reducing paddings for tables"? I hope the following works feel free to suggest modifications. I would then use that on all of the pages. I don't care about correct placement anymore just that it doesn't break anymore I that i can put it away.

with open('styles.css', 'w') as f:
    f.write('@page {size: A4 ;} *{font-size: 90%;padding: 5%;}')
css = CSS(filename='styles.css')
HTML(f'{wiki_page.url}').write_pdf('output.pdf', stylesheets=[css])

gives error TypeError: can't multiply sequence by non-int of type 'float' i guess i cant use * to get every element or am I using it wrong?

I look forward to your commit. Thank you again.

MortalWombat-repo commented 9 months ago
t.py:21 in percentage                                                                            │
│                                                                                                  │
│    18 │   │   return value.value                                                                 │
│    19 │   else:                                                                                  │
│    20 │   │   assert value.unit == '%'                                                           │
│ ❱  21 │   │   return refer_to * value.value / 100                                                │
│    22                                                                                            │
│    23                                                                                            │
│    24 def resolve_one_percentage(box, property_name, refer_to,                                   │
│                                                                                                  │
│ ╭───────────────── locals ──────────────────╮                                                    │
│ │ refer_to = 'auto'                         │                                                    │
│ │    value = Dimension(value=5.0, unit='%') │                                                    │
│ ╰───────────────────────────────────────────╯                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: can't multiply sequence by non-int of type 'float'
liZe commented 9 months ago

Why is the output so much different from url and from html?

The stylesheets are probably different for some reason. Or some resources are broken because paths don’t work with files (and only work behind an HTTP server). Or… It’s difficult to know what’s going on, you have to carefully check the differences in HTML and in CSS to know what’s going on.

And if I also may ask for some clarification for "smaller font sizes and reducing paddings for tables"?

Something like

table { font-size: 0.8em }
th, td { padding: 0.1em }

But again, it won’t work in all cases (and it’s common in Wikipedia to find very large tables that don’t even fit in a browser window.)

gives error TypeError: can't multiply sequence by non-int of type 'float'

It may be a problem caused by repeating test (I had an equivalent bug before the fix.) If you still have a crash after the fix, please open a new issue.

i guess i cant use * to get every element or am I using it wrong?

It’s technically OK, but adding a padding on everything is a bit strange, and changing the font size on html or body is enough (as it’s inherited.)

liZe commented 9 months ago

The bug should now be fixed on the main branch, feedback is welcome!

MortalWombat-repo commented 9 months ago

First I did not see a difference and then i forgot i needed to clone main. Writing for those that have the same problem. Paying it forward and backwards. cloning main branch git clone https://github.com/Kozea/WeasyPrint.git cd WeasyPrint pip install .

After that no matter what i do there is no repeated text. Outstanding work! To be safe i used css = CSS(string='@page {size: A4 ;} html, body {font-size: i used 100% / 50% and 90%;} table { font-size: 0.8em } th, td { padding: 0.1 tried with 0.01 and 1 px to get as close to 1 px or smallest relation to the table as possible}')

Some tables are indeed too big to do without landscape. There is a minor issue of a line across elements, but I'm sure that is not an issue since this project aims to render as it sees, not through webkit and other users will not have many tables and paddings as in this particular case trying to print articles where p and spans are intersected.

image

Really looking forward to see further development of flexbox. I managed to finally understand a bit of Prince documentation (although I have problems with orphans and widows) and I am really intrigued by their solution. They have no dependencies and manage to make things work only through Rust and Mercury.

I originally thought that they ran a ML algorithm and had their engineers write low level catch all's, but procrastinating as I do when I hit a wall.. I later found out that Mercury is a Prolog alternative engineered on an Australian University from where Prince developers started their business.

I am writing this to point out (although I'm sure someone with your experience very well came to the same conclusion long ago) that flexbox may be a very difficult topic to write by yourself.

Maybe languages like Prolog are the most sane alternative where the code is compiled based on constraints that are given. I'm sure you would have great success if some logic programming maintainers came onboard.

Thank you so much for your help and patience.

liZe commented 9 months ago

There is a minor issue of a line across elements, but I'm sure that is not an issue since this project aims to render as it sees, not through webkit and other users will not have many tables and paddings as in this particular case trying to print articles where p and spans are intersected.

Even if it’s not based on WebKit, it’s based on the same specifications! Maybe there’s a bug.

I managed to finally understand a bit of Prince documentation (although I have problems with orphans and widows) and I am really intrigued by their solution. They have no dependencies and manage to make things work only through Rust and Mercury.

Prince is a great piece of software. With Håkon in the team, they know CSS quite well. 😄

I am writing this to point out (although I'm sure someone with your experience very well came to the same conclusion long ago) that flexbox may be a very difficult topic to write by yourself.

That’s actually not that difficult, because the specification is recent and really well written. Some parts about pagination were missing when the first drafts have been written, but it’s now pretty complete. Just as grid, we think that it wouldn’t take more than a week or two, full time, to get something solid.

The limiting factor is dedicated time (i.e. sponsors), not technical difficulties … at least for this feature.

Compared to flex, good old tables are a real nightmare, because they’ve been implemented in browsers before CSS was even born!

Maybe languages like Prolog are the most sane alternative where the code is compiled based on constraints that are given. I'm sure you would have great success if some logic programming maintainers came onboard.

Well, 1 million downloads per month is already quite a great success for us! 😄

But yes, logic (pun intended) programming is theoretically interesting for web rendering. I suppose that browsers don’t use it mainly because it’s harder to find Prolog/Mercury developers than C++ developers!

Thank you so much for your help and patience.

Have fun with WeasyPrint 💜

liZe commented 9 months ago

Maybe there’s a bug.

There’s a bug: #2019

MortalWombat-repo commented 9 months ago

Maybe there’s a bug.

There’s a bug: #2019

Thank you, looking forward to try this when its done.

Right now I'm calling my render good enough and will patiently wait for SO to downvote and close my topic(again) because it was not descriptive enough.

Because i can't provide a code cell to run if i can't install a dependency on the platform(in this case Prince), unless they expect me to dockerize the thing and host it for their own viewing pleasure.

As ruthless a force of nature PDF is I am a bit of a masochist too. As I couldn't help myself to say hey maybe that h3 heading should not be at the bottom of the page(being wikipedia h2 is the title), and maybe i should have no more than 10 lines of widows and orphans. I tried to put it past me, but Prince's documentation drove the point further in line of "this and that is not pretty, this even worse etc." and now it really irks me.

When that inevitably fails i hope that this is an easy fix and i can use your solution in the future. So far i didn't test that many pages but i feel like i had no widows or orphans and headings were broken inside. I don't know if that was by chance or you too implemented a similar solution, one that can hopefully be tweaked too?

Right now I am blatantly crossing the line and I understand if you don't reply because you went above and beyond and closed the issue a few messages ago. I just didn't want to send unsolicited PMs.

As far as I figure your solution is mostly for helping people cut costs for their company or starting a business and is a great way (as in people would expect it to cost money great) to generate receipts and invoices. So I really appreciate you helping me out. This is what FOSS is all about.

liZe commented 9 months ago

As I couldn't help myself to say hey maybe that h3 heading should not be at the bottom of the page(being wikipedia h2 is the title),

That’s something you can do with break-after: avoid.

I don't know if that was by chance or you too implemented a similar solution, one that can hopefully be tweaked too?

WeasyPrint supports a lot of paged media features, including of course widows and orphans (whose default values are 2 in WeasyPrint.) So, that’s not by chance! You can of course change these values with CSS.

When that inevitably fails i hope that this is an easy fix and i can use your solution in the future.

Widows, orphans and page breaks, but also footnotes, page margins, page counters, running elements, leaders, cloned box decorations… There are many, many more features related to pagination supported by WeasyPrint. A simple way learn more about some of them is to read CourtBouillon’s blog, particularly "CSS tricks" entries.

Right now I am blatantly crossing the line and I understand if you don't reply because you went above and beyond and closed the issue a few messages ago. I just didn't want to send unsolicited PMs.

There’s a Matrix channel for longer discussions! 😄

As far as I figure your solution is mostly for helping people cut costs for their company or starting a business and is a great way (as in people would expect it to cost money great) to generate receipts and invoices. So I really appreciate you helping me out. This is what FOSS is all about.

Providing software (and sometimes advice 😁) for free is an important part of our vision of FOSS, but that’s only the tip of the iceberg: what we want to do with WeasyPrint goes beyond this. We build a powerful PDF generator based on open standards we want to defend and develop. We work hard to keep the code simple and maintainable, so that everyone can learn and contribute. We find great partners (💜) that we help to grow, and they help us back financially and morally to provide a better tool for them, and for everyone.

MortalWombat-repo commented 9 months ago

That’s something you can do with break-after: avoid.

This is what I tried to do and I later found out that many breaks are not even supported by a standard and to focus on using the break inside. I am really struggling with that and I read as much as I could find and at this point I'm basically trying different ways seeing if anything sticks.

As you can see I'm not that well versed with web technologies and I am trying to find a community that could help me decipher should I clean up my html or is it that I am using the rule wrong.

I hope that is ok to ask in the Matrix?

WeasyPrint supports a lot of paged media features, including of course widows and orphans (whose default values are 2 in WeasyPrint.) So, that’s not by chance! You can of course change these values with CSS.

Providing software (and sometimes advice 😁) for free is an important part of our vision of FOSS, but that’s only the tip of the iceberg: what we want to do with WeasyPrint goes beyond this. We build a powerful PDF generator based on open standards we want to defend and develop. We work hard to keep the code simple and maintainable, so that everyone can learn and contribute. We find great partners (💜) that we help to grow, and they help us back financially and morally to provide a better tool for them, and for everyone.

Your passion is contagious hopefully one day i too can find my special way of helping others.