eubinecto / idiomatch

An implementation of SpaCy(3.0)'s Matcher specifically designed for identifying English idioms.
41 stars 6 forks source link

The lemmas of idioms should not contain whitespaces #9

Closed eubinecto closed 3 years ago

eubinecto commented 3 years ago

The problem

As of right now, the lemmatised string of idioms contains whitespaces (except the hyphenated ones) like so:

tokenisation: ['You', 'are', 'down to earth', '.']
lemmatisation: ['you', 'be', 'down-to-earth', '.']
filtering: ['down-to-earth']
-----------
tokenisation: ['Have', 'you', 'found your feet', 'on', 'the', 'new', 'job', '?']
lemmatisation: ['have', 'you', "find one's feet", 'on', 'the', 'new', 'job', '?']
filtering: ["find one's feet"]

This may cause some problems when the tokens are serialised into a whitespace-delimeted file (e.g. KeyedVectors ).

Objective

All the whitespaces in the lemmas must be replaced with an underbar. e.g. before after
"find one's feet" "find_one's_feet"

To-do's

eubinecto commented 3 years ago

How should I approach this then?

Just replace the whitspaces with an under bar on loading the idioms.

So the only changes you need to make into are the loaders in identify_idioms/loaders.

simplifying the loaders

On second look into the loaders, they are much bloated than they should be. All they do is just loading some data. There is no need to implement this with the whole OOP fuss like they are written now.

Let's keep it simple. Just re-write them into load_smth()-like functions.

Refactoring the scripts

Just a great rule of thumb to follow: no python code should be placed outside the library root. Let's keep things simple.

Replacing the whitespaces with an underbar

suboptimal solution:

def load_slide_idioms() -> List[str]:
    with open(SLIDE_TSV, 'r') as fh:
        slide_tsv = csv.reader(fh, delimiter="\t")
        # skip the  header
        next(slide_tsv)
        return [
            row[0].replace(" ", "_")
            for row in slide_tsv
        ]

Just add row[0].replace(" ", "_") to every loader. It works, but it's not an elegant way of doing it.

As of right now though, I will call this a day and close the issue. Will refactor this later.