KatrinaHoffert / EatSafe

An app for finding safe places to eat
Other
2 stars 2 forks source link

Translation system for CSV parser #67

Closed KatrinaHoffert closed 9 years ago

KatrinaHoffert commented 9 years ago

When we parse a CSV and know the name, address, and city, but before writing the SQL, we should look inside a translation file to see if an alternative name has been provided. If so, we use the translated name in the generated SQL.

This allows us to fix locations that have the addresses in their names.

The translation file can just be a JSON file like:

[
  {
    "match": {
      "name": "11 Seven - Springfield",
      "address": "123 Fake St",
      "city": "Saskatoon"
    },
    "replace": {
      "name": "11 Seven"
    }
  },
  ...
]

That is, it's an array of objects that contain a pairing of a location to match (we only need the name, address, and city to ensure a unique match) and values to replace (which is just the name here, but we could easily extend this to allow for translating the address, city, postal code, and RHA in the future.

This is done at the CSV parser level, so as far as our application is concerned, the database is unmodified and just has cleaner data. The point of this being to correct the bad naming in the source data (which often duplicates the address in the name).

This approach ensures that when we need to rerun the autodownloader and CSV parser, we won't lose our corrections (as we would if we tried to directly modify our database).

KatrinaHoffert commented 9 years ago

Note: the translation system itself is pretty straightforward to implement. The more time consuming part is creating the translations. We should start by trying to get a rough estimate on how many we'd need to do (it seems to be mostly chains that have these bad names).

It might be worthwhile to create a tool that would list all the locations by name (from our database) and provide the means to change the location names, and the tool would generate the translation file for this. This isn't worth investing time into unless there's a lot of locations to rename.

We also might be able to do some clever text parsing to find most of the cases (since it seems most common to have the real name, a dash, and the address or neighborhood).

At any rate, we don't need perfection. We just need a system that can perform the translations, so that in the future, we can easily correct bad names when we find them (or when real world users point them out).

KatrinaHoffert commented 9 years ago

Removing from M3 for now because changes to requirements make it seem that there might not be time for this. Can be added back if time allows.

LujieDuan commented 9 years ago

The translation function has been implemented in CSV parser. Need to populate the translation file, which is "translation.txt" in database folder.