andersen-lab / bjorn

GNU General Public License v3.0
20 stars 4 forks source link

How is location name matching to ISO-3166 codes done? #8

Closed corneliusroemer closed 3 years ago

corneliusroemer commented 3 years ago

I'd like to help out sanitizing and normalizing GISAID location names to ISO-3166 codes to help e.g. visualization for subdivisions on outbreak.info

There's some sort of hacky ad-hoc name normalization going on in this repo but I don't think that's the most sustainable and best way to fix the GISAID free-text location name problem.

I noticed you use datafunk, is that where most of the heavy lifting for geo-parsing happens?

Which software piece does the translation from 'Switzerland/Zürich' to 'CHE_CH-ZH'?

In /fasta2json.py you do things like 'Zoerich' to 'Zürich' but I don't get the big picture. I've also found mistakes in going from normalised location name to the ISO code, so it'd be great to know where that happens.

flaneuse commented 3 years ago

Thanks for your interest in helping standardize the location names @corneliusroemer -- it's definitely a messy problem.

@AlaaALatif can you help figure out the best way to collaborate?

corneliusroemer commented 3 years ago

I found something here in https://github.com/cov-ert/datafunk/tree/master/datafunk

It looks like someone made an attempt at normalizing subdivisions but it never got very far: https://github.com/cov-ert/datafunk/blob/1aa97c78db207b43f3fe84881cfe43e728e1b8a6/datafunk/gisaid_json_2_metadata.py#L70

That function here also may be relevant, but not quite sure: https://github.com/cov-ert/datafunk/blob/1aa97c78db207b43f3fe84881cfe43e728e1b8a6/datafunk/travel_history.py#L35

I haven't yet figured out how country names are turned into ISO-codes. @AlaaALatif can you point me there? That'd be fab :)

AlaaALatif commented 3 years ago

Hi @corneliusroemer datafunk has nothing to do with location normalization.

ISO codes are added in merge_results.py, specifically in lines 144-148. We use custom-curated GADM files to do so with countries, divisions, and (U.S. only) counties.

gkarthik commented 3 years ago

@corneliusroemer As Al mentioned, the ISO codes are added here: https://github.com/andersen-lab/bjorn/blob/bjorn1_sitrep/src/merge_results.py#L144. The ISO codes themselves are obtained from GADM shapefiles available here.

However, the country, division, and location names are normalized here: https://github.com/andersen-lab/bjorn/blob/master/src/json2fasta.py#L103. Based on https://github.com/andersen-lab/bjorn/pull/7, you seem to have found this. So this is where you would want to add/edit any steps for normalizing names.

If you have a specific proposal to fix location normalization than ad-hoc fixes, could you please post it here? We can iterate over it before you start writing the code to make sure we are on the same page.

corneliusroemer commented 3 years ago

Right, got it! Don't know how I missed that. That helps a lot.

Would you be able to share the relevant dictionary? Since that's what we need to normalise to it's much easier if we know what we're trying to get to: /home/al/data/geojsons/gadm_divisions.json etc.

Is that the only location matching that happens? The code you referenced turns for example 'Germany/Saxony-Anhalt' to 'DEU_DE-ST'.

Am I right in assuming that if we normalised all 26 cantons in Switzerland to something arbitrary, say A-Z that we would then get shown 26 lines here in this graph as opposed to the 40 something? But that the map would then not show anything, because for something to appear on the map, the GADM matching needs to be correct?

Thinking about it like that would split the problem in 2: normalising the GISAID location mess to something unique, and then make a bijection from that unique set to the GADM set for display.

The first problem, making sense of the GISAID mess seems to be the bigger problem, the second one, the bijection from well-defined subdivision typology to GADM is trivial in comparison.

So I'll try to think about tackling GISAID normalisation in the first place.

Do you have a GISAID country/subdivision dump somewhere? Some 100k samples to play with? I don't yet have access to the GISAID API, but could probably scrape the normal frontend to get the location data to play with.

For step 2, matching to GADM, I'd need the GADM division names, if you could publish that, that'd be great, too. We'll get there!

AlaaALatif commented 3 years ago

hi @corneliusroemer, I uploaded the aforementioned GADM files here: https://drive.google.com/file/d/1Lzo9RZE_FReF6Z8QKq01P5nQmMrp42Oz/view?usp=sharing

Please keep us posted on your progress. Good luck!

Cheers, Al

corneliusroemer commented 3 years ago

Thanks that helps a lot! Now I just need the GISAID counterpart as training data and off I go :)

Has no one yet solved this problem of normalising GISAID location freeform data? With more than a million sequences on GISAID this is not a worthless task!

gkarthik commented 3 years ago

Yes, this is definitely not straight forward. It depends on so many labs across the globe depositing sequences and the relevant metadata. A first step would be a simple hamming distance while accounting for non ASCII characters to find closest matches.

corneliusroemer commented 3 years ago

Applying bioninformatics to linguistics :')

Alternative: use robust search engines to identify location, then reverse genocide using shapefiles in gadm.

Don't you think search engines would perform well here?

Somehow need to be able to deal with the long tail of locations while not reinventing the wheel.