Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
95 stars 8 forks source link

URL Tokens Feature and correct drop duplicate #330

Closed tupini07 closed 5 years ago

tupini07 commented 5 years ago

This PR introduces the URL Tokens feature, which compares the similarity between two list of URL tokens, while considering a list of URL stopwords (see https://github.com/Wikidata/soweego/issues/243#issuecomment-489113736).

Another change that is introduced is the way in which the duplicate values are dropped from a pandas DataFrame. Before this PR, pandas would drop duplicated rows only by checking if they had the same column values. Now it has been changed so that the dropped rows are only those that have the same index.

closes #312