MattX / Milton

A searchable article database
1 stars 1 forks source link

Deduplicate URLs based on text #6

Open MattX opened 4 years ago

MattX commented 4 years ago

When a new article is submitted, check if an article with very high text similarity already exists, and if yes, don't index it.

This would be more robust than trying to deduplicate URLs manually.