BU-Spark / ml-herbarium

Herbaria ML
15 stars 12 forks source link

ML-Herbarium: Feature - generate corpus.txt with all the taxon names #32

Closed angietseng closed 2 years ago

angietseng commented 2 years ago

I saw repeated names in the corpus file. Are these names duplicates in the dataset too, or was it the code duplicating them?

These names are duplicates with different scientific name authorships in the dataset. I will modify the code to get rid of the repeated names.