SolutionGuidance / psm

Welcome to the Medicare/Medicaid Provider Enrollment Screening Portal
http://projectpsm.org/
Other
26 stars 18 forks source link

Full-text search #934

Open jasonaowen opened 6 years ago

jasonaowen commented 6 years ago

We need to add the ability to do full-text search in the PSM.

This comes from a few search-related requirements:

  • psm-FR-2.14 The PSM shall use consistent provider naming conventions to differentiate between first names, last names, and business or corporate names and to allow flexible searches based on the provider name.
  • psm-FR-7.13 The PSM shall support searching and pattern-matching based on all fields accepted as input (and based on all reasonable combinations of such fields).
  • psm-IU-2.5 The PSM shall provide full-text search capability

The first is relevant to this insofar as we need to be able to search for names; the other two seem to be asking the same thing in different words.

Investigate the available full-text search solutions, consider their requirements, choose one, and integrate it.

jasonaowen commented 6 years ago

I think the biggest constraint we have as we investigate solutions is that we don't want to complicate deployment, which means we don't want any external servers or services.

Given that, the two solutions @HemKal and I have found so far are:

Hibernate Search is provided by WildFly (along with Hibernate itself), and if we ever change application servers we would be able to bundle it in our build instead. Lucene is more powerful than PostgreSQL's full-text search, although I'm not yet sure what exactly that means. It integrates with Hibernate to update its index each time we commit modifications to an @Entity (including creation). HS requires storing an index file on disk; it's not clear to me yet where that should live, how big it would get, or what potential deployment challenges that might bring (such as permissions around the application server OS user reading and writing files on disk). It's also not clear to me yet how it would handle searching across multiple entities; the examples I've seen so far are all for searching throughout a single entity type. Finally, I don't yet understand how it scales across multi-node application servers (nor, to be fair, how important that might be for the PSM).

PostgreSQL has full-text search built in. We could either add a tsvector column to each table we want to search, or we could make a denormalized table that relates back to other tables. Per-table columns could be kept up to date with triggers; the preferred solution for denormalization seems to be a materialized view, which needs to be refreshed periodically.

Both of these solutions have some challenges we'll need to figure out:

We will continue to research this.

slifty commented 6 years ago

I think I got some of your answers (originally posted by @kfogel )

(@kfogel adds: Thanks to Katherine Stewart of the Louisiana Department of Health and Darryl Hellams of Virginia Medicaid for taking the time to answer.)

jasonaowen commented 6 years ago

I think it makes sense for us to say, at least for a first pass, that full text search is for providers and service admins to search enrollments.

High-level things that I suggest we not include in search (at least for now):

I expect that last one, the contents of uploaded licenses, to be something that we do want to support someday; for the moment, however, it would mean parsing and indexing arbitrary files. If the provider uploaded a picture or scan of their license, do we need to do text recognition on that image? If they uploaded a PDF, can we usefully extract text from it? If the PDF is effectively an image, with no textual data, do we then need to do text recognition? This is a large enough problem that I suggest we defer it until we get the structured data we already have in a searchable format.

What specific @Entitys do we want to be searchable?

I reviewed the current list of Hibernate entities, and I think these are the ones we care about:

This may not be a complete list.

I suspect that we may need to do some of the data model improvements mentioned in #57 to effectively implement full text search; many of the relationships are application-side in a way that make linking individual entities to the underlying concept difficult at best.