centre-for-humanities-computing / friths

0 stars 0 forks source link

Duplicates and language detection #6

Open jankounchained opened 11 months ago

jankounchained commented 11 months ago

We have duplicate documents. And documents that are in other languages than English.

Need to remove other languages than English. But duplicates should probably only be flagged, because they could still be relevant for the reseach question.

jankounchained commented 11 months ago

goal of lang detect: