Readme todo - Githubissues

josh-rhodes commented 2 years ago

[x] Add geometry parameters explanation for target geometry datasets
[x] Add target geometry output file explanation
[x] Expand user defined regex standardisation files explanation
[x] Add explanation of fuzzy string matching algorithm options

kmcdono2 commented 2 years ago

Hi @josh-rhodes - I know @mialondon has already made some useful comments about the readme, but here are a few more!

Quick note: some of the hyperlinks to section headers seem broken.

Just re-upping @mialondon's comment about clarifying what the problem is and how this code solves it would be really great. E.g. I-CeM data isn't geolocated, and we don't have historical street vector data, so this does X and Y to create a dataset that accomplishes Z.

Love both of the figures.

Seems that "Set parameters" and "folder structure and data" are not in TOC. Is that intentional?

Does there need to be a short explanation of the rationale for using 1851 parish boundaries for the whole period? And, if someone wanted to only work with a later census year, would it be useful to explain some of the limitations/challenges of using 1851 boundaries for, say, 1901 census data?

I don't think there is an explanation of the role that Sub-Registration Districts play in coordination with the Parish boundaries. That would be handy.

Under "Data Output" --> "Linked", it's unclear how the table "fields" relates to the sample output. Perhaps some explanation about how the fields in the sample output correspond? (same for next 2 output files).

"New geometry file" - I would explain why splitting/combining lines is useful, so that people can further digest why RSDs are useful and whether or not this kind of consideration is important for they way they want to use the data.

Finally, I would suggest adding a section to the README about how to contribute (if people have questions/feature requests/etc, what should they do - open a ticket? email you? etc).

For the citation: once you've picked a license and published this as a package, perhaps worth considering https://joss.theoj.org/ as a way to have a paper citation for the code?

josh-rhodes commented 2 years ago

Thanks so much for these comments @kmcdono2 ! Just working my through these...

Under "Data Output" --> "Linked", it's unclear how the table "fields" relates to the sample output. Perhaps some explanation about how the fields in the sample output correspond? (same for next 2 output files).

Hmm, ok this needs to be made clearer then. What I'm trying to show below is the type of field and then an actual sample of the data. So, target geometry address is the field type e.g. it's the field from the target geometry that contains the address string ('High Street'). For GB1900, that's final_text as you know, and for OS OpenRoads it's name1. Because the pipeline can take any geometry dataset and you just tell it which field the address is in, the outputs from different target geometry datasets will look different because they reflect the naming conventions of the target geometry.

Similarly, you can use different types of fuzzy string matching algorithms, and so the field that stores the score will be named according to the algorithm you choose.

Does that make sense? Shall I add in an explanation along those lines?

fields
target geometry address
census address unique id
target geometry address unique id
census address
tfidf weighting
fuzzy string comparison score
weighted fuzzy string comparison score

Sample output

final_text	unique_add_id	gb1900_1851	address_anonymised	tfidf_w	rapidfuzzy_wratio_s	rapidfuzzy_wratio_ws
SOUTH HILL PARK	BAGSHOT ROAD AND SOUTH HILL PARK_1452.0_1300001.0	5815d6182c66dc3849011ef2_1452.0_1300001	BAGSHOT ROAD AND SOUTH HILL PARK	0.0743637355789496	0.9	0.06692736202105463
BARTHOLOMEW STREET	BARTHOLOMEW STREET SHAWS COURT_1260.0_1200002.0	5848759c2c66dcdcda000168_1260.0_1200002	BARTHOLOMEW STREET SHAWS COURT	0.06117412360590356	0.9	0.05505671124531321
BARTHOLOMEW STREET	BARTHOLOMEW STREET STILLMANS COTTAGES_1260.0_1200002.0	5848759c2c66dcdcda000168_1260.0_1200002	BARTHOLOMEW STREET STILLMANS COTTAGES	0.05017107492325159	0.9	0.04515396743092643

josh-rhodes commented 2 years ago

@kmcdono2

Does there need to be a short explanation of the rationale for using 1851 parish boundaries for the whole period? And, if someone wanted to only work with a later census year, would it be useful to explain some of the limitations/challenges of using 1851 boundaries for, say, 1901 census data?

The parish boundaries are for 1851 but I-CeM uses that file to create the consistent parish boundaries for each census year. Here we dissolve the 1851 parishes to create the correct consistent parish boundaries for the given census year (later we create a union with the RSDs to break the big consistent parishes up a bit more).

I'll take a look at the readme again and clarify this!

https://github.com/Living-with-machines/historic-census-gb-geocoder/blob/15d4161d344aed681e8fdf04b681a271157b7db0/historic-census-gb-geocoder/ew_geom_preprocess.py#L75-L103

Living-with-machines / CensusGeocoder

Readme todo #16