abingham / decktape.io

Website for running decktape conversions.
1 stars 1 forks source link

Better candidate matching #14

Closed abingham closed 8 years ago

abingham commented 8 years ago

Right now the candidate matching is pretty dumb, just levenshtein distance from the target url. This means that literally everything matches to some degree. We should figure out a better approach that takes into account what the user means. This will probably involve matching on domains or perhaps on path segments...something interesting.

abingham commented 8 years ago

Here's one resource of algorithms.

abingham commented 8 years ago

Interesting, simple, levenshtein-based approach.

abingham commented 8 years ago

stringscore seems pretty great.

abingham commented 8 years ago

Part of what we need to do for candidate matching is show only the most recent succesful conversion for a given URL. I think we can do that more quickly in mongo than in Python, and I think this snippet does the trick:

client = pymongo.MongoClient('localhost', 27017)
db = client.decktape_io
coll = db.file_ids
coll.aggregate([
    {'$match': {'metadata.status': 'complete'}},
    {'$sort': {'metadata.timestamp': -1}},
    {'$group': {'_id': '$metadata.url',
                'md': {'$first': '$metadata'}}}
])

Put this behind a new API in the result-db and test the heck out of it.

abingham commented 8 years ago

Fixed in fe9210ced95b76e7705d33ff7f191cb24df0582e