freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
529 stars 144 forks source link

PDF text conversions are ugly #123

Open freelawbot opened 10 years ago

freelawbot commented 10 years ago

I just found this as an open source fork of the poppler project. It says it does its best to emulate the formatting found on resource.org from a PDF.

If this works well, we should switch to this instead of the pdf2text utility.


freelawbot commented 10 years ago

Issue #85 was marked as a duplicate of this issue.

Only relevant information there was a mention of Crocodoc and Flexpaper.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

Discovered and investigated Google's version of the same: https://docs.google.com/viewer. Seems to fail with third party cookies are disabled.

Zoho apparently has a similar product as well, though with funky branding. Flexpaper still has the lead in my mind.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

I've been doing some investigation of how to show PDFs in a better way, since our current implementation is so very hard to read.

The solution I've found at the moment is to use Flex Paper and SWFTOOLS PDFtoSWF converter to show the PDFs as a Flash "movie".

I'm not a huge fan of Flash, but neither am I fan of PDF, and I think this is the best way to allow people to read the PDFs without downloading them to disk.

It might involve having FlexPaper's branding on the site, and it would involve converting EVERY PDF to a SWF file, and storing those in the DB as well. But, it seems to be the way others are doing this.

I would love to hear any considerations or thoughts others have about this, from usability to FOSS, to anything else.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

Actually, the installation of libpoppler2 isn't necessary. To make this work, you just have to do LD_LIBRARY_PATH='/usr/lib'

And then it seems to work fine.

I was able to get it to create a couple of HTML versions of some PDFs, though the quality is pretty poor. I am curious what other processing AltLaw and resource.org do to the PDFs, because this doesn't look as good as their HTML versions.

Sample is attached.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

Worked on this a bit to see how it would handle volume 545 of the Federal Reports. Used:

git clone http://github.com/stuartsierra/altlaw-pdf.git

Then needed the following dependency: libfontconfig1-dev

Then ran ./configure make make install

Which will get you halfway. The second half is to add this to sources.list: deb http://security.ubuntu.com/ubuntu hardy-security main

Then to do apt-get install libpoppler2 (from 2007).

Then it seems to work.


Original Comment By: Mike Lissner

freelawbot commented 10 years ago

Link would help: http://github.com/stuartsierra/altlaw-pdf


Original Comment By: Mike Lissner