Closed djay closed 3 years ago
@pmdscully not really clear what you are saying. I incorporated your difflib code a long time ago. https://github.com/djay/covidthailand/blob/main/utils_thai.py#L485 https://github.com/djay/covidthailand/blob/main/utils_pandas.py#L96
I think I might be answering your points below.
I'm not sure its important to seperate out all the alternative names into abreviations, eng etc. They are all just alternatives. Can be done later though.
As for picking a better canonical name... I may not have picked the best one. Happy to change it to another set. However keep in mind that all the current api exports use the current one I have now. ie ProvinceEn in the prov_mappings.csv.
@pmdscully is the confusing just about the column names in prov_mapping? If it was "Alternative Province Name" -> "Canonical Province Name" would that help?
However, : Code Problems at the moment:
- the large number of entries make this still quite unclear to me.
- the inconsistent column data is unclear to me - i.e. Thai/Eng/Abbr in first column, English spelling variation in second column.
[CanonicalName,Othername,Othername,...]
. Actual Format [Thai/Eng/Abbr, Othername]
.I hope that is now clear.
Ideally, we'd just pull in the Wikipedia table, and add 1 new record for prisons. Then add a new column for colloquial abbreviations
(probably stored in a list format, to later eval()
or .explode()
)
As far as I can see, that's the initial data
needed to do province translations (both ways).
So, ideal format is either:
CanonicalName_EN,ProvNameTH,..
(others from wikipedia table as desired)..,,AbbrList
CanonicalName_TH,ProvNameEN,..
(others from wikipedia table as desired)..,,AbbrList
This approach seems like the least cause of confusion, and least exposure to unseen bugs.
Here's an example of Code for Wiki province table read in
and then creation of the two translation dictionaries
:
2. Expected: Format
[CanonicalName,Othername,Othername,...]
. Actual Format[Thai/Eng/Abbr, Othername]
.
I don't see what that gains. it's the same information just in a more complicated format. Putting the data into columns doesn't make anything clearer because some provinces have 20 mispellings and some have 2.
@pmdscully you can change it to a csv 78 lines long but I'm not sure what it buys you. Keep in mind the canonical names come from this code
url = "https://en.wikipedia.org/wiki/Healthcare_in_Thailand#Health_Districts"
file, _ = next(web_files(url, dir="html", check=False))
areas = pd.read_html(file)[0]
provinces = areas.assign(Provinces=areas['Provinces'].str.split(", ")).explode("Provinces")
not out of the CSV. So there is already checks to make sure it has teh right provinces in. The CSV is just for alternative names, not to keep all canonical info about provinces.
One thing the current format does is make diffs nicer as you add alternative spellings
url = "https://en.wikipedia.org/wiki/Healthcare_in_Thailand#Health_Districts"
However, as far as I can see, table does not give Thai names (so translation mappings), which does seem rather important...
- Expected: Format
[CanonicalName,Othername,Othername,...]
. Actual Format[Thai/Eng/Abbr, Othername]
.I don't see what that gains. it's the same information just in a more complicated format. Putting the data into columns doesn't make anything clearer because some provinces have 20 mispellings and some have 2.
Consistency. (perhaps open clarity?) Which buys less opportunity for bugs.
What bugs are you concerned about? There is a list of mappings from other names for provinces to a canonical english name.
Bugs i.e.:
Non-duplication closes the opportunity for duplication related errors...
Occam's razor stuff....?
Which lines of code are the TH-EN or EN-TH province mappings carried out on?
There isn't a need to translate English into Thai. Just translate alternative spellings into a canonical name (for which I choose English to make it easier for me)
On Fri, 28 May 2021, 22:44 Peter Scully, @.***> wrote:
Which lines of code are the TH-EN or EN-TH province mappings carried out on?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/djay/covidthailand/pull/20#issuecomment-850508209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKFZFKHFP4AMRC3WTG7PTTP625NANCNFSM45VT4VVA .
@pmdscully don't get me wrong though. Happy for a PR to turn this into a table of 78 lines with explode the alternate names. But also happy for it to stay as it is for now.
Happy for a PR to turn this into a table of 78 lines with explode the alternate names. But also happy for it to stay as it is for now.
Cool, let's (/let me) put together this 78 line alternative via that Colab. We can check it and compare before deciding to include or not.
I'll add in:
Anything else?
I've updated the Colab, which now produces a dataframe / csv, with:
Changes:
Printing the new df has highlighted a few points:
.lower()
and replace( spaces )
to the original province names in Thai and Eng.@djay - Take a look and see if I am missing anything..
@pmdscully looks good. I think maybe you don't need the [ ] in the alternative name column? could use a simpler separator that works with str.split then explode?
The countries have to be mapped to unknown unfortunatly because they appear in the scraped data.
So is your plan now to do a new PR that commits this file and code to load and explode the mapping so scraping works as it did before?
@djay - I've added a class to perform the lookups.
o = ProvinceNamesLookup()
o.do_lookup()
to create the mappings dicts, and retain (on first call) in the object. This will try:
lower().replace(' ','')
.difflib
code.province_lookup = ProvinceNamesLookup()
prov = "Bangkok"
province_lookup.do_lookup(prov, df.copy().reset_index())
@pmdscully looks good. I think maybe you don't need the [ ] in the alternative name column? could use a simpler separator that works with str.split then explode?
From a Pandas and Generic Python point of view, I'd leave as is - because Pandas internal storage and recall just works for "list-like objects". But, if you prefer to use a custom format, please go ahead. (I've added you as an editor if you want to add this or play about with it.)
The countries have to be mapped to unknown unfortunatly because they appear in the scraped data.
Okay, I thought there must be a reason for this, so I made no changes. An option for later is simply to get a table of country names and add into the Unknown
's altnames
.
So is your plan now to do a new PR that commits this file and code to load and explode the mapping so scraping works as it did before?
Sadly, I am still not sure about the integrations for this. I have been successful in encapsulating this functionality. i.e.
difflib
) similarity function integrated.difflib
similarity function.do_lookup()
to ensure mappings are as expected. (Expected: Returns "Canonical Name", i.e. from Name
column).@pmdscully it doesn't need a lookup funciton. all of that works well. it just needs to import your csv and turn it into the same mapping df I use now. That is the least amount of changes
@pmdscully ie just replace the code in this function up until https://github.com/djay/covidthailand/blob/main/utils_thai.py#L215 and ensure the resulting df has the same column names and you have used the same canonical names and it will just work
@pmdscully ie just replace the code in this function up until https://github.com/djay/covidthailand/blob/main/utils_thai.py#L215 and ensure the resulting df has the same column names and you have used the same canonical names and it will just work
Okay, cool.
In this case, then I would think we can replace up until this line https://github.com/djay/covidthailand/blob/25fe9c54f12ca1470edbe333fbb947334ff86f4a/utils_thai.py#L226
which is adding Populations per province. The Wikipedia provinces table already has a column for this.
@pmdscully it doesn't need a lookup funciton. all of that works well. it just needs to import your csv and turn it into the same mapping df I use now. That is the least amount of changes
Okay.
If you have those categories of mappings. Then I think, we're good to go.
__get_alt_name_mappings()
function I wrote, it converts the df['alt_names']
Lists
into your standard mapping_dict
format (i.e. {Province->ProvinceEn,..}
); feel free to reuse that conversion code if needed. @djay - As you say, the column names will need changing.
Can you do print(df.columns)
for that df, and post it here, so we can rename and verify the new csv columns.
@pmdscully you are probably going to have to run it to submit a working PR anyway. but the table looks like
Area of Thailand Health District Number ProvinceEn
ACR Lower Northeastern Region Area 2 10 Amnat Charoen
ATG Upper Central Region Area 2 4 Ang Thong
AYA Upper Central Region Area 1 4 Phra Nakhon Si Ayutthaya
Amnat Charoen Lower Northeastern Region Area 2 10 Amnat Charoen
AmnatCharoen Lower Northeastern Region Area 2 10 Amnat Charoen
... ... ... ...
เวียงจันทร์ Unknown Unknown
แพร่ Upper Northern Region Area 2 1 Phrae
แมฮ่อ่งสอน Upper Northern Region Area 1 1 Mae Hong Son
แม่ฮ่องสอน Upper Northern Region Area 1 1 Mae Hong Son
ไม่ระบุ Unknown Unknown
[540 rows x 3 columns]
Area of thailand is unused so ignore that. Health District Number is str also.
@pmdscully are you able to create a new PR?
@djay - I propose to do the following (isolated)
tasks, and then I think you can best finalise the integration and its decisions.
(isolated)
CSV Health District column - cast to str.(isolated)
Rename CSV columns to match existing dataframe columns. [index
=mappings, 'Health District Number', 'ProvinceEn']
(isolated)
Write and Test isolated conversion code
from new CSV format to existing data columns./dir/
?pd.read_csv()
to load in new CSV and copy in conversion code
.@pmdscully not really sure i understand. Is there a reason you can't just create a PR which runs that I can review? if I understand what you proposing above, then I think you would end up checking in a similar CSV to what I already have? All you need to do is create a PR where you replace my CSV with your CSV and then add a bit of code to covert it into a mapping from index -> Province En and heath district number.
@pmdscully not really sure i understand. Is there a reason you can't just create a PR which runs that I can review?
@djay Ahh, now I remember..... It's because it does not run on Windows (without about 2-3 filename compatibility changes). I recommended these changes a month ago or so, but you did not include them, because there were concerns they would affect named caching files. These are things like removing *
, ?
and \\
(or //
I forget now) characters from all filenames.
So, the reason is because I cannot run it without multiple other changes, that I expect will not get included; thus I expect I will repeat the fix each time I git clone
and yet not commit that change...
If Windows Compatibility can be included (this might break other compatibilities, and I can't know because we don't have unit tests), then I can run, trial and make changes more easily, then I can make a PR. Otherwise, I'm fishing in the dark.
Okay, so:
*
, ?
, \\
and //
are removed from filenames?@pmdscully sorry I put in all the os.path changes but forgot about the filename cache changes. I can put those in I think
@pmdscully fixed. Cache was only and issue if most of it had to redownloaded. Thats not the case.
Thanks @djay I'll take another git pull
and another look.
@pmdscully fixed. Cache was only and issue if most of it had to redownloaded. Thats not the case.
@djay - So I ran the latest committed version, however, I hit this error again:
Exporting: risk_groups
Exporting: risk_groups_unmatched
========ArcGIS==========
Missing: json\query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&resultOffset=0&resultRecordCount=1000&cacheHint=true
Download: json\query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&resultOffset=0&resultRecordCount=1000&cacheHint=trueTraceback (most recent call last):
File ".\covid_data.py", line 2004, in <module>
scrape_and_combine()
File ".\covid_data.py", line 1976, in scrape_and_combine
hospital = get_hospital_resources()
File ".\covid_data.py", line 1940, in get_hospital_resources
file, content = next(web_files(every_district, dir="json", check=True))
File "\covidthailand\utils_scraping.py", line 190, in web_files
with open(file, "wb") as f:
OSError: [Errno 22] Invalid argument: 'json\\query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=*&resultOffset=0&resultRecordCount=1000&cacheHint=true'
So I applied my fix for Windows Filename Compatibility.
utils_scraping.py -> Line 184:
file = ''.join([f for f in file if f not in '?*:<>|']) # Windows Filename Compatibility.
I can successful complete a full run, when running locally on Windows.
def get_provinces()
Okay, I think this change moves py-dicts into csv to make them easier to manage, which is a reasonable way forward.
However, : Code Problems at the moment:
Actual Problems with Province Translation / Mapping Task:
Base Solution via Real Namings In Thailand, there are 77 provinces. Plus we can include the Correctional Prisons reporting, so 78 entries. There are formal namings for the Thai and English variations of those 77 here on wiki. That includes 3 letter "Harmonized System" naming for tax, etc.
Problematic Abbreviations of Namings In cases, RB has used abbreviations in the past. As far as I am aware, there's no new data to collect from RB these days. However, perhaps I am missing something.
Which solution? I believe you are working towards putting these abbreviations into
X->Y,Z,A,B,C
mappings. Which would give a total of 78 entries.Problematic Parsed Namings Spelling variations, as extracted from the parsed PDFs (i.e. spelling variations in Thai) have existed in the past to cause problems.
Solution to Problematic Parsed Namings In these
non-direct
mappings cases, we can use similarity matching. As we only have strings (no additional contextual data to use),difflib
seems the reasonable choice. I have written about using difflib with dict mappings here.