Closed abingham closed 8 years ago
Here's one resource of algorithms.
stringscore seems pretty great.
Part of what we need to do for candidate matching is show only the most recent succesful conversion for a given URL. I think we can do that more quickly in mongo than in Python, and I think this snippet does the trick:
client = pymongo.MongoClient('localhost', 27017)
db = client.decktape_io
coll = db.file_ids
coll.aggregate([
{'$match': {'metadata.status': 'complete'}},
{'$sort': {'metadata.timestamp': -1}},
{'$group': {'_id': '$metadata.url',
'md': {'$first': '$metadata'}}}
])
Put this behind a new API in the result-db and test the heck out of it.
Fixed in fe9210ced95b76e7705d33ff7f191cb24df0582e
Right now the candidate matching is pretty dumb, just levenshtein distance from the target url. This means that literally everything matches to some degree. We should figure out a better approach that takes into account what the user means. This will probably involve matching on domains or perhaps on path segments...something interesting.