Bill detail: Preserve the formatting of legislative text

reginafcompton commented 6 years ago

The legislative text on the Bill detail does not preserve the tabs and underlines given in Legistar.

Example: https://laws.council.nyc.gov/legislation/int-1633-2017/ http://legistar.council.nyc.gov/LegislationDetail.aspx?ID=3066702&GUID=FF2098E2-9EC9-47AB-9131-B02FDEA71AA6

Why? We parse the "plain text" from the Legistar API, which does not contain such formatting. (See Councilmatic filter.)

Solutions? (1) Find a way to translate the rtf_text into HTML on the Councilmatic side. (2) Revert to scraping the full_text from the web, as we did in the past: https://github.com/opencivicdata/scrapers-us-municipal/blob/12a6309c3148b83729a1f2420b9ee1bd1bd633e2/nyc/bills.py#L127

I'd prioritize option number one, if its possible.

reginafcompton commented 6 years ago

I explored options for converting the RTF string into HTML. I could not find a Python package that does the job. However, I did identify a couple options, which involve doing the conversion on the client-side (jumping off point):

(1) rtf.js, though I need to explore if we can use a string-version RTF as a starting point for calling displayRTFFile: https://github.com/tbluemel/rtf.js/blob/master/samples/rtf.html#L148

(2) a couple good-looking node packages: https://github.com/iarna/rtf-to-html and https://github.com/walling/node-unrtf. NOTE - we'll need to find a way to coordinate node_modules within a Django app. I do not believe we have other Django apps within our DataMade repertoire that do this (it might be worth learning about!), so that's a consideration.

(3) Write a parser ourselves, as this Stackoverflow-er does for RTF to plan text: https://stackoverflow.com/questions/1337446/is-there-a-python-module-for-converting-rtf-to-plain-text

fgregg commented 6 years ago

i would suggest using

using python's subprocess https://docs.python.org/2/library/subprocess.html

I would suggest doing this as part of the import process

reginafcompton commented 6 years ago

thanks @fgregg - unrtf is pretty nifty (and easy to install and get started!). I also like the idea of doing the conversion on the import, rather than the view or client side.

reginafcompton commented 6 years ago

I have an initial solution for converting rtf to html in import_data: https://github.com/datamade/django-councilmatic/pull/165.

However, the resultant HTML is corrupt, i.e., it contains extraneous </div> tags that disrupt the formatting. (See screenshot below for attempt to capture formatting issue.) Question to address: is the corruption in the rtf or the result of the unrtf conversion?

It seems that we could build a custom filter (see below), but it can be difficult to predict when and where the extraneous tags occur. Another question: is there a python parser than can facilitate the transformation of invalid HTML into valid HTML?

# Compress HTML to searchable string without multi-lines.
no_multiline = re.sub(r"([\n ])\1*", ' ', text)

# Remove extra </div> tags and other extraneous text.
return no_multiline.replace('</div> <br> </div>', '</div> <br>').replace('..Title', '').replace('..Body', '')

With this said, the converted html does not seem to preserve exactly the indentation given in the Legistar bill text - that is a requirement of this issue.

fgregg commented 6 years ago

bummer.

Keeping it as html has significant seo and usability benefits, so let's keep pushing on this a little bit more.

I would suggest looking at pandoc or another rtf-html converter as well.

fgregg commented 6 years ago

The fallback is always to display the pdf.

reginafcompton commented 6 years ago

Right, but the legislation text report does not reside in the Legistar API. (For example, it would be here, right? https://webapi.legistar.com/v1/nyc/matters/50755/attachments?token=.....) We'd need to refashion the scraper to grab the PDF link from the web interface....or scrape the html itself, as we did in the past.

Also, I'd like to determine if the rtf is corrupt, before looking at another rtf-converter.

Bummer indeed.

reginafcompton commented 6 years ago

I checked some bills, and the RTF from Legistar seems fine: I managed to convert a couple RTF instances to valid HTML with an online tool.

However, I noticed that some of the scraped bills in the OCD API do not contain valid RTF text (but rather plain text or some variant of the plain_text)....this seems to come as a result of not having RTF in Legistar: https://ocd.datamade.us/ocd-bill/1ff42bd8-2919-4e2d-bb1d-adbe5d36689d/ https://ocd.datamade.us/ocd-bill/267a7123-0027-4ad3-b28f-c4b7ff866a1f/ https://ocd.datamade.us/ocd-bill/283ae440-3f91-4f84-8263-7c9ac7dcd4a2/

For converters...if we want to keep pushing on this:

Pandoc does not support RTF to HTML, given what I can see in their documentation...but maybe we can use Ted and Pandoc together? https://stackoverflow.com/questions/30448176/how-to-convert-rtf-to-markdown-on-the-unix-osx-command-line-similar-to-pandoc
this tool gets us close, but it does not correctly convert underlines, nor does it support characters (i.e., we would lose symbols like "§"): https://github.com/lvu/rtf2html
Libre Office - as we did with the Metro merger

With that said, I am happy to try out the Ted + Pandoc combo, but I am drawn towards scraping the PDF link from the web. That seems like a relatively more reliable option. (Plus, the PDFs look nice - no worries about missed formatting.) @fgregg - what do you think?

reginafcompton commented 6 years ago

We have working solution for this in an open PR...also on the Councilmatic side. Currently testing it on the staging site.

reginafcompton commented 6 years ago

We are running the full conversion script on the production server (after determining that we needed: a new version of LibreOffice, and code suitable for Python 3.4.3).

Last steps

add a cronjob that runs after import_data
use flock! we should lock the same file for import_data and convert_rtf, in case the convert script takes longer than 15 minutes (which seems like a possibility, if we import a large quantity of bills)
merge this PR, and pin the requirements to master (as we did in the past)

reginafcompton commented 6 years ago

We've been able to convert RTF to HTML with a new management script in django-councilmatic. I've added it to our NYC crontasks, as well.

We also have an open PR that will allow us to pull in PDFs of Bill text, too. Marking this issue as closed!

datamade / nyc-council-councilmatic

Bill detail: Preserve the formatting of legislative text #108