djay / covidthailand

Thailand Covid testing and case data gathered and combined from various sources for others to download or view
126 stars 15 forks source link

combine sources of province names into local province_mapping.csv #20

Closed djay closed 3 years ago

pmdscully commented 3 years ago

Okay, I think this change moves py-dicts into csv to make them easier to manage, which is a reasonable way forward.

However, : Code Problems at the moment:

  1. the large number of entries make this still quite unclear to me.
  2. the inconsistent column data is unclear to me - i.e. Thai/Eng/Abbr in first column, English spelling variation in second column.

Actual Problems with Province Translation / Mapping Task:

Base Solution via Real Namings In Thailand, there are 77 provinces. Plus we can include the Correctional Prisons reporting, so 78 entries. There are formal namings for the Thai and English variations of those 77 here on wiki. That includes 3 letter "Harmonized System" naming for tax, etc.

Problematic Abbreviations of Namings In cases, RB has used abbreviations in the past. As far as I am aware, there's no new data to collect from RB these days. However, perhaps I am missing something.

Which solution? I believe you are working towards putting these abbreviations into X->Y,Z,A,B,C mappings. Which would give a total of 78 entries.

Problematic Parsed Namings Spelling variations, as extracted from the parsed PDFs (i.e. spelling variations in Thai) have existed in the past to cause problems.

Solution to Problematic Parsed Namings In these non-direct mappings cases, we can use similarity matching. As we only have strings (no additional contextual data to use), difflib seems the reasonable choice. I have written about using difflib with dict mappings here.

djay commented 3 years ago

@pmdscully not really clear what you are saying. I incorporated your difflib code a long time ago. https://github.com/djay/covidthailand/blob/main/utils_thai.py#L485 https://github.com/djay/covidthailand/blob/main/utils_pandas.py#L96

I think I might be answering your points below.

I'm not sure its important to seperate out all the alternative names into abreviations, eng etc. They are all just alternatives. Can be done later though.

As for picking a better canonical name... I may not have picked the best one. Happy to change it to another set. However keep in mind that all the current api exports use the current one I have now. ie ProvinceEn in the prov_mappings.csv.

djay commented 3 years ago

@pmdscully is the confusing just about the column names in prov_mapping? If it was "Alternative Province Name" -> "Canonical Province Name" would that help?

pmdscully commented 3 years ago

However, : Code Problems at the moment:

  1. the large number of entries make this still quite unclear to me.
  2. the inconsistent column data is unclear to me - i.e. Thai/Eng/Abbr in first column, English spelling variation in second column.
  1. Expected: The unique number of entries in the list of provinces should be 78. Actual: >78.
  2. Expected: Format [CanonicalName,Othername,Othername,...]. Actual Format [Thai/Eng/Abbr, Othername].

I hope that is now clear.

pmdscully commented 3 years ago

Ideally, we'd just pull in the Wikipedia table, and add 1 new record for prisons. Then add a new column for colloquial abbreviations (probably stored in a list format, to later eval() or .explode())

As far as I can see, that's the initial data needed to do province translations (both ways).

So, ideal format is either:

  1. CanonicalName_EN,ProvNameTH,..(others from wikipedia table as desired)..,,AbbrList
  2. CanonicalName_TH,ProvNameEN,..(others from wikipedia table as desired)..,,AbbrList

This approach seems like the least cause of confusion, and least exposure to unseen bugs.

pmdscully commented 3 years ago

Here's an example of Code for Wiki province table read in and then creation of the two translation dictionaries:

djay commented 3 years ago

2. Expected: Format [CanonicalName,Othername,Othername,...]. Actual Format [Thai/Eng/Abbr, Othername].

I don't see what that gains. it's the same information just in a more complicated format. Putting the data into columns doesn't make anything clearer because some provinces have 20 mispellings and some have 2.

djay commented 3 years ago

@pmdscully you can change it to a csv 78 lines long but I'm not sure what it buys you. Keep in mind the canonical names come from this code

    url = "https://en.wikipedia.org/wiki/Healthcare_in_Thailand#Health_Districts"
    file, _ = next(web_files(url, dir="html", check=False))
    areas = pd.read_html(file)[0]
    provinces = areas.assign(Provinces=areas['Provinces'].str.split(", ")).explode("Provinces")

not out of the CSV. So there is already checks to make sure it has teh right provinces in. The CSV is just for alternative names, not to keep all canonical info about provinces.

One thing the current format does is make diffs nicer as you add alternative spellings

pmdscully commented 3 years ago
url = "https://en.wikipedia.org/wiki/Healthcare_in_Thailand#Health_Districts"

However, as far as I can see, table does not give Thai names (so translation mappings), which does seem rather important...

pmdscully commented 3 years ago
  1. Expected: Format [CanonicalName,Othername,Othername,...]. Actual Format [Thai/Eng/Abbr, Othername].

I don't see what that gains. it's the same information just in a more complicated format. Putting the data into columns doesn't make anything clearer because some provinces have 20 mispellings and some have 2.

Consistency. (perhaps open clarity?) Which buys less opportunity for bugs.

djay commented 3 years ago

What bugs are you concerned about? There is a list of mappings from other names for provinces to a canonical english name.

pmdscully commented 3 years ago

Bugs i.e.:

  1. Translation mapping errors :-)
  2. With duplications of entries / text (copy-paste & duplication bugs) including unexpected typos.

Non-duplication closes the opportunity for duplication related errors...

Occam's razor stuff....?

pmdscully commented 3 years ago

Which lines of code are the TH-EN or EN-TH province mappings carried out on?

djay commented 3 years ago

There isn't a need to translate English into Thai. Just translate alternative spellings into a canonical name (for which I choose English to make it easier for me)

On Fri, 28 May 2021, 22:44 Peter Scully, @.***> wrote:

Which lines of code are the TH-EN or EN-TH province mappings carried out on?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/djay/covidthailand/pull/20#issuecomment-850508209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKFZFKHFP4AMRC3WTG7PTTP625NANCNFSM45VT4VVA .

djay commented 3 years ago

@pmdscully don't get me wrong though. Happy for a PR to turn this into a table of 78 lines with explode the alternate names. But also happy for it to stay as it is for now.

pmdscully commented 3 years ago

Happy for a PR to turn this into a table of 78 lines with explode the alternate names. But also happy for it to stay as it is for now.

Cool, let's (/let me) put together this 78 line alternative via that Colab. We can check it and compare before deciding to include or not.

I'll add in:

Anything else?

pmdscully commented 3 years ago

I've updated the Colab, which now produces a dataframe / csv, with:

Changes:

Printing the new df has highlighted a few points:

@djay - Take a look and see if I am missing anything..

djay commented 3 years ago

@pmdscully looks good. I think maybe you don't need the [ ] in the alternative name column? could use a simpler separator that works with str.split then explode?

The countries have to be mapped to unknown unfortunatly because they appear in the scraped data.

So is your plan now to do a new PR that commits this file and code to load and explode the mapping so scraping works as it did before?

pmdscully commented 3 years ago

@djay - I've added a class to perform the lookups.

How to Perform Province Lookups with this DF / Dataset:

  1. Init object to retain mappings data. o = ProvinceNamesLookup()
  2. Call o.do_lookup() to create the mappings dicts, and retain (on first call) in the object. This will try:
    • (i) key mapping on raw names in: Eng, Thai, Alt names.
    • (ii) key mapping on cleaned names (as in (i)), cleaned via lower().replace(' ','').
    • (iii) key mapping via similarity. <- This needs to be implemented using existing difflib code.

Usage Example:

province_lookup = ProvinceNamesLookup()
prov = "Bangkok"
province_lookup.do_lookup(prov, df.copy().reset_index())
pmdscully commented 3 years ago

@pmdscully looks good. I think maybe you don't need the [ ] in the alternative name column? could use a simpler separator that works with str.split then explode?

From a Pandas and Generic Python point of view, I'd leave as is - because Pandas internal storage and recall just works for "list-like objects". But, if you prefer to use a custom format, please go ahead. (I've added you as an editor if you want to add this or play about with it.)

The countries have to be mapped to unknown unfortunatly because they appear in the scraped data.

Okay, I thought there must be a reason for this, so I made no changes. An option for later is simply to get a table of country names and add into the Unknown's altnames.

So is your plan now to do a new PR that commits this file and code to load and explode the mapping so scraping works as it did before?

Sadly, I am still not sure about the integrations for this. I have been successful in encapsulating this functionality. i.e.

  1. The CSV can be exported if preferred, rather than the whole parsing process.
  2. The object to perform the lookup is isolated. Though it will need your (difflib) similarity function integrated.

Remaining Tasks:

  1. Include difflib similarity function.
  2. Unit test do_lookup() to ensure mappings are as expected. (Expected: Returns "Canonical Name", i.e. from Name column).
  3. Copy in the exported CSV.
  4. Copy in the class and object calling code.
  5. Integrate object calling code.
  6. Integration tests.
djay commented 3 years ago

@pmdscully it doesn't need a lookup funciton. all of that works well. it just needs to import your csv and turn it into the same mapping df I use now. That is the least amount of changes

djay commented 3 years ago

@pmdscully ie just replace the code in this function up until https://github.com/djay/covidthailand/blob/main/utils_thai.py#L215 and ensure the resulting df has the same column names and you have used the same canonical names and it will just work

pmdscully commented 3 years ago

@pmdscully ie just replace the code in this function up until https://github.com/djay/covidthailand/blob/main/utils_thai.py#L215 and ensure the resulting df has the same column names and you have used the same canonical names and it will just work

Okay, cool.

In this case, then I would think we can replace up until this line https://github.com/djay/covidthailand/blob/25fe9c54f12ca1470edbe333fbb947334ff86f4a/utils_thai.py#L226

which is adding Populations per province. The Wikipedia provinces table already has a column for this.

@pmdscully it doesn't need a lookup funciton. all of that works well. it just needs to import your csv and turn it into the same mapping df I use now. That is the least amount of changes

Okay.

My Checks:

If you have those categories of mappings. Then I think, we're good to go.

Thai -> Eng:

Alt-Name -> Eng:

pmdscully commented 3 years ago

@djay - As you say, the column names will need changing.

Can you do print(df.columns) for that df, and post it here, so we can rename and verify the new csv columns.

djay commented 3 years ago

@pmdscully you are probably going to have to run it to submit a working PR anyway. but the table looks like

                               Area of Thailand Health District Number                ProvinceEn
ACR            Lower Northeastern Region Area 2                     10             Amnat Charoen
ATG                 Upper Central Region Area 2                      4                 Ang Thong
AYA                 Upper Central Region Area 1                      4  Phra Nakhon Si Ayutthaya
Amnat Charoen  Lower Northeastern Region Area 2                     10             Amnat Charoen
AmnatCharoen   Lower Northeastern Region Area 2                     10             Amnat Charoen
...                                         ...                    ...                       ...
เวียงจันทร์                                                    Unknown                   Unknown
แพร่               Upper Northern Region Area 2                      1                     Phrae
แมฮ่อ่งสอน         Upper Northern Region Area 1                      1              Mae Hong Son
แม่ฮ่องสอน         Upper Northern Region Area 1                      1              Mae Hong Son
ไม่ระบุ                                                        Unknown                   Unknown

[540 rows x 3 columns]

Area of thailand is unused so ignore that. Health District Number is str also.

djay commented 3 years ago

@pmdscully are you able to create a new PR?

pmdscully commented 3 years ago

@djay - I propose to do the following (isolated) tasks, and then I think you can best finalise the integration and its decisions.

djay commented 3 years ago

@pmdscully not really sure i understand. Is there a reason you can't just create a PR which runs that I can review? if I understand what you proposing above, then I think you would end up checking in a similar CSV to what I already have? All you need to do is create a PR where you replace my CSV with your CSV and then add a bit of code to covert it into a mapping from index -> Province En and heath district number.

pmdscully commented 3 years ago

@pmdscully not really sure i understand. Is there a reason you can't just create a PR which runs that I can review?

@djay Ahh, now I remember..... It's because it does not run on Windows (without about 2-3 filename compatibility changes). I recommended these changes a month ago or so, but you did not include them, because there were concerns they would affect named caching files. These are things like removing * , ? and \\ (or // I forget now) characters from all filenames. image

So, the reason is because I cannot run it without multiple other changes, that I expect will not get included; thus I expect I will repeat the fix each time I git clone and yet not commit that change...

If Windows Compatibility can be included (this might break other compatibilities, and I can't know because we don't have unit tests), then I can run, trial and make changes more easily, then I can make a PR. Otherwise, I'm fishing in the dark.

Okay, so:

djay commented 3 years ago

@pmdscully sorry I put in all the os.path changes but forgot about the filename cache changes. I can put those in I think

djay commented 3 years ago

@pmdscully fixed. Cache was only and issue if most of it had to redownloaded. Thats not the case.

pmdscully commented 3 years ago

Thanks @djay I'll take another git pull and another look.

pmdscully commented 3 years ago

@pmdscully fixed. Cache was only and issue if most of it had to redownloaded. Thats not the case.

@djay - So I ran the latest committed version, however, I hit this error again:

Exporting: risk_groups
Exporting: risk_groups_unmatched
========ArcGIS==========
Missing: json\query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&resultOffset=0&resultRecordCount=1000&cacheHint=true
Download: json\query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&resultOffset=0&resultRecordCount=1000&cacheHint=trueTraceback (most recent call last):
  File ".\covid_data.py", line 2004, in <module>
    scrape_and_combine()
  File ".\covid_data.py", line 1976, in scrape_and_combine
    hospital = get_hospital_resources()
  File ".\covid_data.py", line 1940, in get_hospital_resources
    file, content = next(web_files(every_district, dir="json", check=True))
  File "\covidthailand\utils_scraping.py", line 190, in web_files
    with open(file, "wb") as f:
OSError: [Errno 22] Invalid argument: 'json\\query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&resultOffset=0&resultRecordCount=1000&cacheHint=true'

So I applied my fix for Windows Filename Compatibility.

Here's the Windows Filename Compatibility:

utils_scraping.py -> Line 184:

file = ''.join([f for f in file if f not in '?*:<>|']) # Windows Filename Compatibility.

I can successful complete a full run, when running locally on Windows.

Included it in the PR:

Link to PR: https://github.com/djay/covidthailand/pull/28