Closed fedenanni closed 3 years ago
OK here is where I am. I have set up the deezy environment, downloaded the wikidata from your Zenodo deposit, cloned this repo, checked out the bhocands_wiki_alignment
branch.
Now I am trying to run python create_britTM_dataset.py
but I get this error: No such file or directory: '/resources/wikigazetteer/wikiGaz_en_filtered.pkl'
I have wikiGaz_en_basic.pkl
but not wikiGaz_en_filtered.pkl
How do I make that? I can't seem to find instructions anywhere. Or do I just rename the basic.pkl
?
Hi @kmcdono2! You're right, there's a gap there, I will update the code and instructions in a bit, thanks for checking!
Hi @fedenanni,
@mcollardanuy ehi! it's a tiny PR and still sketched - but if you could check the function in
align_bho_cands_to_wikidata.ipynb
and tell me where that should go (or do we want to have it in this independent script for now?) - the other script is just a draft that will be expanded soon
Thanks! I have moved the two notebooks to a new folder: toponym_resolution/bho_wikidata
. I hope that's fine, this way we can add a new directory for other combinations of datasets. I thought of keeping the bho/
folder specifically for things BHO-related (e.g. processing), and likewise for the wikidata/
folder.
I have checked the notebook and it all looks good, but I haven't properly tested it yet: there's that other PR waiting for DeezyMatch regarding the distance metrics, so I will try to do that other review first because it will have a direct impact on this one.
@mcollardanuy good idea for the resolution/
folder, that's perfect. Ok, let's do the other PR and come back to this one after
Hi @fedenanni,
I will now review the notebooks align_bho_cands_to_wikidata.ipynb
and stats_on_disambiguating_bho_cands.ipynb
.
Could you have a look at the changes I made in the rest of the notebooks/scripts in this pull request, if you have time? They are small but important changes in the code, mostly about fixing paths and making sure all pieces of code are connected. Sorry that I reused this PR for this!
The readmes should be up-to-date now:
bho/README.md
: describes obtaining and processing the BHO topographical dictionaries, prepares it to use with DeezyMatch.wikidata/README.md
: describes obtaining and processing the Wikidata, prepares it to use with DeezyMatch (⚠️ do not run wikidata extraction: it takes two days to process wikidata)toponym_matching/README.md
: contains instructions to obtain DeezyMatch candidates from scratch, given two processed datasets (e.g. bho
and wikidata
).The main changes in the codes are adding prepare_britwikigaz.ipynb
(a new notebook that downloads an English wikigazetteer from Zenodo and creates a British Isles-filtered version of the Wikipedia gazetteer, which will be used to train a DeezyMatch model) and fixing paths or documenting.
Thanks!
@mcollardanuy ok - all good on my side!
@mcollardanuy ehi! it's a tiny PR and still sketched - but if you could check the function in
align_bho_cands_to_wikidata.ipynb
and tell me where that should go (or do we want to have it in this independent script for now?) - the other script is just a draft that will be expanded soon