Develop Python/R Address Steming for Deduplication

maxachis commented 4 years ago

"Stemming" refers to reducing words to their lexical base, or word stem. For example, "likely", "likeable", and "likelihood" all have a word stem of "like". More information on how this works can be found at https://www.geeksforgeeks.org/introduction-to-stemming/, as well as other locations.

As far as the Food Access Map is concerned, stemming could be useful for deduplication. Many different road types have abbreviated forms (e.g. "Street" as "St.", "Boulevard" as "Blvd", "Court" as "Ct."), and there are known cases of duplicates in our data set that would have the same street addresses except that one has an abbreviated form of a word, and one does not (e.g. "5050 Liberty Ave" and "5050 Liberty Avenue" exist in distinct rows in our dataset).

We could use a method, preferably developed in R but possibly in Python, that takes addresses and returns their "stemmed" forms, with the addresses all abbreviated (or not abbreviated) to ensure duplicates are better identified. These could then either be added to the dataset as an additional column, or modify an existing column (or be used in some other way entirely).

Whether by using an existing package or developing this ourselves, such stemming would ease the difficulties in our deduplication process.

A sub-issue of #16 - Dedupe food stores in merged dataset

maxachis commented 4 years ago

Oscar indicated his interest in working on this, but others can feel free to take a swing at this as well.

hellonewman commented 3 years ago

Catalina, adding you as you said you wanted more info on Oscar's PR.

maxachis commented 3 years ago

Got myself a little confused this morning looking at this and going "Why did I close this initially?"

The reason was because Oscar made a pull request for the script that I'd accepted and added to the repository. With that being said, I'll close this again :D .

CodeForPittsburgh / food-access-map-data

Develop Python/R Address Steming for Deduplication #48