FinalsClub / karmaworld

KarmaNotes.org v3.0
GNU Affero General Public License v3.0
7 stars 6 forks source link

BeautifulSoup makes pdf2html ugly. #290

Closed btbonval closed 10 years ago

btbonval commented 10 years ago

Doing the same filtering that lxml used to do (replacing anchor tags), BeautifulSoup breaks the HTML rendering.

soup.prettify() renders its internal tree to HTML, but that rendered HTML seems to be of a form different enough to completely mess with the appearance of pdf2html rendered HTML.

It'd be fantastic to filter all HTML, no matter its source, through these filters.

Many pandas were sad about this.

btbonval commented 10 years ago

Before BeautifulSoup (14_motor1pdf.html) and after BeautifulSoup (bs14.html), the number is the file size:

-rw------- 1 bryan bryan 15832814 Jan 17 14:42 14_motor1pdf.html
-rw-r--r-- 1 bryan bryan 15848496 Jan 17 14:42 bs14.html

When BS was trying to output with prettify(), there were some errors about ASCII encoding. This is likely because graphics are encoded with unicode into the HTML. BS will output a unicode string in that case, which errors on conversion to str string. Whereas if prettify('utf-8') is specified, the output is a str string, already properly encoded.

I suspect these encoding problems might be the cause of file size changes and ugliness.

Maybe there's a way to output BS trees without calling prettify?

btbonval commented 10 years ago

http://stackoverflow.com/questions/7219361/python-and-beautifulsoup-encoding-issues http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output

The pdf2html is clearly marked as UTF-8 in the metadata. However, it is possible that non-UTF-8 is mixed in. In such a very specific case, there's a way to deal with the document: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#inconsistent-encodings

btbonval commented 10 years ago

Using str(soup) outputs a file that is much smaller than soup.prettify() and almost identical in size to the original (13 more characters).

-rw------- 1 bryan bryan 15832814 Jan 17 14:42 14_motor1pdf.html
-rw-r--r-- 1 bryan bryan 15848496 Jan 17 14:42 bs14.html
-rw-r--r-- 1 bryan bryan 15832827 Jan 17 23:57 bs14str.html

It also looks correct by visual inspection. The ugliness is almost certainly because BeautifulSoup is adding nice spacing which is somehow making its way inside tags, disrupting them. Removing the nice spacing helps a great deal.