AlexanderSenf / ucla_data_processing

An example for data processing
Apache License 2.0
0 stars 0 forks source link

Implement question 3 #3

Closed AlexanderSenf closed 3 years ago

AlexanderSenf commented 3 years ago

Add implementation for question 3.

If a product code is not recognized, the Levenshtein distance between existing product code keys and the code from a sales file are calculated. The first match with a score of 75 (or higher) is used to substitute the existing key for the unrecognized code. This allows for automatic error correction of cases where at most 1 character is incorrect. (This obviously doesn't account for cases where more than one product code key has a Levenshtein distance of >= 75)

This approach is also not the most efficient one, but demonstrates the principle of dealing with errors in the input.