datamade / how-to

📚 Doing all sorts of things, the DataMade way
MIT License
87 stars 12 forks source link

Named entity recognition with deep learning #44

Closed jeancochrane closed 3 years ago

jeancochrane commented 5 years ago

Background

Recent advances have made deep learning much easier and more cost effective. These advances include:

These developments have implications for a number of common problems that our clients face, including OCR and entity resolution. To get a better sense of the current landscape and what possibilities it offers for our business, I want to focus in on named entity recognition, a document analysis task that has proven challenging in the past.

Proposal

I propose a medium-size R&D project to evaluate the feasibility of deep learning for document analysis tasks for DataMade. I'll use named entity recognition on either Chicago Councilmatic data or WhoWasInCommand data as a starting point for evaluation.

As this project involves exploring a developing field, I'm proposing to do more reading and writing than I have for R&D projects in the past. I plan to start with a proposal for the specific task I want to accomplish with a particular set of documents. Then, I'll perform a field scan to identify possible solutions to the problem. Finally, I'll try as many solutions as I can, and produce a report evaluating the costs and benefits of each solution.

Deliverables

Deliverables for this R&D will include, in the following order:

  1. A research outline for named entity recognition on either Chicago Councilmatic data or WhoWasInCommand data, detailing what specific task will be accomplished
  2. A field scan
  3. A document comparing a few different approaches

Timeline

I think I can get this R&D done in four R&D days (two months). My anticipated timeline is:

Day 1: Draft research outline, request a review, and start field scan Day 2: Finalize research outline, finish drafting field scan, and request review Day 3: Begin evaluating solutions Day 4: Finish evaluating solutions and request review on a report

There's a good possibility that I won't have enough expertise in some solutions to be able to evaluate them completely. When this happens, I'll prioritize opening up an issue and moving on to another solution.

jeancochrane commented 5 years ago

OCR is particularly interesting for us here, thinking about tables and charts.

Zoning is an interesting case. Let's see if we can figure out who the entities are that are benefitting from requesting the zoning change.

Ted Han and other NewsNerdery folks may have ideas about what to consider for the field scan.

jeancochrane commented 4 years ago

Here's my articulation of the task at hand (AKA my research outline). I'd like to extract the following entities out of ordinances tagged as Zoning Reclassification in Councilmatic:

Generally, I expect this to require the following subtasks:

  1. Download the PDFs for all relevant ordinances
  2. For each PDF:
    1. OCR (text extraction) on the PDFs, including for forms and tables, if applicable
    2. Check whether forms and tables were detected, check for the relevant entities
    3. If no form/table entities found, pass the text through to free-text entity recognition
      • Depending on the solution, this may require custom training
    4. Check to see if relevant entities were found
    5. Produce graphs for any entities found in the document
  3. Merge graphs with entity resolution

Anything I'm missing here?

jeancochrane commented 4 years ago

AWS Textract

Turns out that the task I set out to accomplish (extract a relationship graph from Councilmatic zoning amendment PDFs) is entirely achievable with AWS Textract. This was a strong sign that the service would be useful, so I decided to take some time to evaluate it first before doing a more extensive field scan.

I'm not done with my analysis yet, but after what I've done so far I would already recommend Textract for OCR, table extraction, and basic entity recognition tasks. What follows are my notes so far.

Background: Why Textract?

Textract is AWS's OCR service. It can take images or PDFs in via API, either as byte streams or S3 objects, and extract the following types of data:

Free text extraction returns the raw text contained in the document, along with metadata like the bounding box of the detected text and the page it was found on. Here's a sample from the console demo, which lets you upload one document and visualize the results:

Screen Shot 2019-10-11 at 11 59 13 AM

Textract is really good at extracting printed text, but it's not as good at handwritten text:

Screen Shot 2019-10-11 at 12 00 14 PM

The table and form extraction services can pull out structured data from document pages:

Screen Shot 2019-10-11 at 12 01 05 PM

Form extraction is the killer feature that makes Textract workable for the zoning analysis task. Even though Textract can't actually recognize entities, Textract works for this task because every zoning amendment ordinance must include an application where each of the relevant entities (applicant, attorney, and property) are clearly labeled.

Pricing

The cost of free text extraction in Textract is pretty reasonable at $1.50/1,000 pages. Tables cost $15/1,000 pages and forms cost $50/1,000 pages, however, which is a lot more imposing. For context, there are about 55,000 pages of zoning amendments in Chicago Councilmatic.

To minimize costs, I was able to organize my pipeline to perform free text extraction on all pages of all documents, and then look for the first page that has a title indicating it's an applicant form. Then, I can pull out just the pages for applicant forms and feed them to Textract.

The pricing for this pipeline is something like:

55,000 pages ($1.50/1,000 free-text pages) = $82.50 1,382 documents (1 application/document) * ($50/1,000 form pages) = $69.1

Gotchas

Addendum: AWS Comprehend

Comprehend is AWS's text analysis service. Unlike Textract, Comprehend can perform actual entity recognition, along with classification tasks like language detection and clustering tasks like topic modelling and keyword extraction.

I haven't actually evaluated Comprehend in detail yet, but it seems intriguing. The way that Comprehend works for entity recognition is that it comes with pretrained models that can detect some basic entity types like Person and Organization. Here's a preview in the console:

Screen Shot 2019-10-11 at 12 24 51 PM

Up to 12 custom entities are also supported but you have to bring your own training data. While you could also just train your own model with training data, the value propositions are:

  1. Textract trains on top of pretrained models, theoretically giving better performance for small training sets
  2. Textract performs some AutoML to try to detect good base models and hyperparameters

Neither of these propositions are demonstrated in any detail in the docs so it remains to be seen how true they are.

jeancochrane commented 4 years ago

The data processing pipeline for https://github.com/datamade/councilmatic-zoning-analysis is all set and ready to go. On Friday I'm going to kick off the job and request review.

My high-level takeaway so far is that working with the Textract API is less intuitive than I had hoped based on the web UI. The worst part is the data model: all results are returned as "blocks" (basically an array of recognized elements), where each block can contain a nested tree of other blocks to which it is a parent, and one given element (such as the text "Zoning Amendment Application") can be duplicated as multiple block types -- word, line, or "key-value pair" (for form elements). The consequence of this data model is that the two simple operations I wanted to do as part of this research (merge together all the text of a document and search it, then extract a specific form element) result in some pretty complicated dictionary-parsing code.

Because of these data modelling issues, I'd like to do a rapid evaluation of Tesseract as an appendix to this R&D effort. Tesseract doesn't support form-level OCR, but if it's easier to use for simple word-level OCR then it could be a better choice for the Forest Preserve project.

jeancochrane commented 4 years ago

I had a good meeting with Haru Coryne from Propublica this morning to chat about the data we have. Haru thought the idea was interesting, in particular the collection of attorneys, but he doesn't immediately have any kind of domain expertise in the topic and the best he could offer was talking to a lawyer to see any of the top firms stood out to them.

However, Haru did think that there were some other fields that might be more compelling for us to try to parse, including:

  1. Which projects are approved vs. denied
  2. Wards that the projects are in (we could geocode and cross-reference these, but this may also be a standard form element that we could just parse)
  3. Project details (on a separate sheet in the ordinance, not necessarily as standardized)
  4. Equity stakes from the EDS

I'm going to take a quick look this month and see if we can adjust the pipeline to grab any of these fields. In particular I'm going to focus on 4., which Haru seemed most excited about.

hancush commented 4 years ago

love the eds love 🐴🙃

jeancochrane commented 4 years ago

I haven't heard from Haru since I asked for some more clarification on which equity stakes field he's interested in. I think that we could potentially pull out some of these fields, particularly the equity stakes fields if they correspond to the fields that I'm thinking of, but it won't be simple and without buy-in from Haru I don't think it's worth it at this point.

I'd like to go ahead and write up a quick blog post explaining what we did here to wrap up this R&D. Would love to get takes from the rest of the R&D team about what might be interesting for the public to hear from us, and how much we can safely share.

jeancochrane commented 4 years ago

Let's reorient here to write some documentation based on our CCAO work.

jeancochrane commented 4 years ago

Made some good progress today on open source alternatives to AWS Comprehend for entity extraction. I got https://github.com/chicago-justice-project/article-tagging up and running and was able to successfully train the address extraction model. According to the training and validation set, performance was pretty good (AUC ~0.96), but in practice I wasn't super happy with my testing. It worked OK on the example given in the documentation:

>>> article_text = ('The homicide occurred at the 1700 block of S. Halsted Ave.'
...   ' It happened just after midnight. Another person was killed at the'
...   ' intersection of 55th and Woodlawn, where a lone gunman')
>>> geo.extract_geostrings(article_text)
([['1700', 'block', 'of', 'S.', 'Halsted', 'Ave.', 'It'], ['intersection', 'of', '55th', 'and', 'Woodlawn,']], [array([0.72149652, 0.78146851, 0.79187167, 0.78661132, 0.74181545,
       0.67825341, 0.60275376]), array([0.55764413, 0.611085  , 0.78005159, 0.70029932, 0.67056262])])

But when I threw it another sample string it failed pretty badly:

>>> article_text = ('In Lincoln Park, Sterling Bay’s newly redeveloped medical research facility '
...    ' at 2430 N. Halsted St. is already at more than 50 percent capacity. Sterling Bay '
...    'purchased the property from Lurie Children’s Hospital in 2018, Goudie said.')
>>> geo.extract_geostrings(article_text)
([], [])

I want to try a transfer learning model using spaCy's entity extraction model, docs on that here and see how that does. I'm also curious to dig into the training data and see how the addresses are formatted. It's possible that there just aren't enough examples of typical USPS-style addresses for this to work well.

jeancochrane commented 4 years ago

I also found a nice roundup of papers on table extraction with deep learning: https://nanonets.com/blog/table-extraction-deep-learning/ This is the first step toward an open source alternative to Textract. One recent paper even has some code we could try: https://github.com/DevashishPrasad/CascadeTabNet

jeancochrane commented 3 years ago

Documentation we need for CCAO:

hancush commented 3 years ago

Revised deliverable: Wrap up key points (how to learn more)

jeancochrane commented 3 years ago

I had hoped to add some documentation here, but my time is running low. Instead I'll leave some pointers to further resources here, which may be useful in case anyone picks up this thread in the future.

Background

My comment above is a good introduction to Textract. Textract is useful when:

If your tables have consistent bounds on the page, tabula is a simpler open source alternative that will let you define bounds for tables to extract. If you just need to extract free text and not structured data, pytesseract is another open source option.

Data structures

The key challenge to working with Textract is that the data structure it returns, blocks, are really confusing. Basically, Textract returns a huge array of blocks, where each block can represent a line, word, or character in the text. The blocks can have hierarchical relationships with one another (a character is part of a word, which is part of a line) but the data is returned as a flat array, so it's up to you to reconstruct the hierarchies you need or filter the blocks for what you want (typically a line or a word).

The Textract docs have some nice examples that show how to handle common operations and give a sense of what working with blocks looks like.

Examples