Another change that is introduced is the way in which the duplicate values are dropped from a pandas DataFrame. Before this PR, pandas would drop duplicated rows only by checking if they had the same column values. Now it has been changed so that the dropped rows are only those that have the same index.
This PR introduces the URL Tokens feature, which compares the similarity between two list of URL tokens, while considering a list of URL stopwords (see https://github.com/Wikidata/soweego/issues/243#issuecomment-489113736).
Another change that is introduced is the way in which the duplicate values are dropped from a pandas DataFrame. Before this PR, pandas would drop duplicated rows only by checking if they had the same column values. Now it has been changed so that the dropped rows are only those that have the same index.
closes #312