Improve search functionality (goal is to improve how it returns results for lv-dof)

codeforamerica / naics-api

Basic API to return NAICS codes and information

BSD 3-Clause "New" or "Revised" License

96 stars 55 forks source link

Improve search functionality (goal is to improve how it returns results for lv-dof) #20

Open louh opened 10 years ago

louh commented 10 years ago

Some things @rclosner and I discussed:

fuzzy/partial matches
search terms with other words between them
ranked relevance (terms matching title or illustrative examples should rank higher than in description)

cc @migurski

lovehandle commented 10 years ago

@louh @migurski

I think for our application it makes the most sense to insert NAICS API data into a DB. In the same vein, I think the API would probably draw some benefit from moving to a DB as well (as opposed to a flat JSON file). We could utilize an existing FTS program, and we'd probably gain some response time speed improvements while we're at it.

Maybe this deserves a new thread, but what are the thoughts on adding tags to classification codes? I think the additional metadata would help in retrieving relevant codes for FTS's.

louh commented 10 years ago

What's an FTS?

How are tags generated? I think the "index entries" and "illustrative examples" were an attempt by the NAICS writers to provide text hooks for finding codes by similar titles.

migurski commented 10 years ago

FTS is full-text search.

I’d hold off on a DB for now; the dataset IIRC is very small, and if loaded at launch time could happily hang out in memory without the overhead of an external database service. I’m not even sure this is worth testing for now; the simplicity benefit of running from flat local files is tremendous. Building a simple full text index is very easy, if even necessary here.

lovehandle commented 10 years ago

That's a good point, actually. Using a DB for the API would probably be a little overkill. Mostly what I'd like to get is something a little fuzzier than exact match. I've been playing around with some of the FTS JS libraries that have this out of the box (e.g. lunr, fullproof, etc), but they've all been a little wonky.

My initial thought would be that the addition of tags (maybe generated by Mechanical Turk?) would weight the searches in the right direction. Not sure how feasible that is, though. Open to ideas.

louh commented 10 years ago

I'd suggest weighting searches in this order: title, index entries, illustrative examples, description

As an aside, the DOF front end is now storing all search inputs in the background for later analysis. Maybe something useful could come of that.

lovehandle commented 10 years ago

@louh interesting. Should we persist search terms in the session? We could save them into the DB for later usage.

louh commented 10 years ago

@rclosner Yep, I'm trying to figure out how LocalStorage works now... if we want to save it, it'll be there for now on the user side.

lovehandle commented 10 years ago

@louh :+1:

migurski commented 10 years ago

LocalStorage or cookies are probably a better bet, since they will leave us with a read-only application for lower maintenance burden.

Fuzzy matches are an interesting opportunity for Git/JSON-driven search terms. They can be initially generated automatically, committed to the project, and then further edited via manual updates and such. Things like synonyms and related terms are so human.