Closed eubinecto closed 3 years ago
How should I approach this then?
Just replace the whitspaces with an under bar on loading the idioms.
So the only changes you need to make into are the loaders in identify_idioms/loaders
.
On second look into the loaders, they are much bloated than they should be. All they do is just loading some data. There is no need to implement this with the whole OOP fuss like they are written now.
Let's keep it simple. Just re-write them into load_smth()
-like functions.
Just a great rule of thumb to follow: no python code should be placed outside the library root. Let's keep things simple.
suboptimal solution:
def load_slide_idioms() -> List[str]:
with open(SLIDE_TSV, 'r') as fh:
slide_tsv = csv.reader(fh, delimiter="\t")
# skip the header
next(slide_tsv)
return [
row[0].replace(" ", "_")
for row in slide_tsv
]
Just add row[0].replace(" ", "_")
to every loader. It works, but it's not an elegant way of doing it.
As of right now though, I will call this a day and close the issue. Will refactor this later.
The problem
As of right now, the lemmatised string of idioms contains whitespaces (except the hyphenated ones) like so:
This may cause some problems when the tokens are serialised into a whitespace-delimeted file (e.g. KeyedVectors ).
Objective
To-do's
scripts