Closed josh-rhodes closed 1 week ago
Hi @josh-rhodes - I know @mialondon has already made some useful comments about the readme, but here are a few more!
Quick note: some of the hyperlinks to section headers seem broken.
Just re-upping @mialondon's comment about clarifying what the problem is and how this code solves it would be really great. E.g. I-CeM data isn't geolocated, and we don't have historical street vector data, so this does X and Y to create a dataset that accomplishes Z.
Love both of the figures.
Seems that "Set parameters" and "folder structure and data" are not in TOC. Is that intentional?
Does there need to be a short explanation of the rationale for using 1851 parish boundaries for the whole period? And, if someone wanted to only work with a later census year, would it be useful to explain some of the limitations/challenges of using 1851 boundaries for, say, 1901 census data?
I don't think there is an explanation of the role that Sub-Registration Districts play in coordination with the Parish boundaries. That would be handy.
Under "Data Output" --> "Linked", it's unclear how the table "fields" relates to the sample output. Perhaps some explanation about how the fields in the sample output correspond? (same for next 2 output files).
"New geometry file" - I would explain why splitting/combining lines is useful, so that people can further digest why RSDs are useful and whether or not this kind of consideration is important for they way they want to use the data.
Finally, I would suggest adding a section to the README about how to contribute (if people have questions/feature requests/etc, what should they do - open a ticket? email you? etc).
For the citation: once you've picked a license and published this as a package, perhaps worth considering https://joss.theoj.org/ as a way to have a paper citation for the code?
Thanks so much for these comments @kmcdono2 ! Just working my through these...
Under "Data Output" --> "Linked", it's unclear how the table "fields" relates to the sample output. Perhaps some explanation about how the fields in the sample output correspond? (same for next 2 output files).
Hmm, ok this needs to be made clearer then. What I'm trying to show below is the type of field and then an actual sample of the data. So, target geometry address
is the field type e.g. it's the field from the target geometry that contains the address string ('High Street'). For GB1900, that's final_text
as you know, and for OS OpenRoads it's name1
. Because the pipeline can take any geometry dataset and you just tell it which field the address is in, the outputs from different target geometry datasets will look different because they reflect the naming conventions of the target geometry.
Similarly, you can use different types of fuzzy string matching algorithms, and so the field that stores the score will be named according to the algorithm you choose.
Does that make sense? Shall I add in an explanation along those lines?
fields |
---|
target geometry address |
census address unique id |
target geometry address unique id |
census address |
tfidf weighting |
fuzzy string comparison score |
weighted fuzzy string comparison score |
Sample output
final_text | unique_add_id | gb1900_1851 | address_anonymised | tfidf_w | rapidfuzzy_wratio_s | rapidfuzzy_wratio_ws |
---|---|---|---|---|---|---|
SOUTH HILL PARK | BAGSHOT ROAD AND SOUTH HILL PARK_1452.0_1300001.0 | 5815d6182c66dc3849011ef2_1452.0_1300001 | BAGSHOT ROAD AND SOUTH HILL PARK | 0.0743637355789496 | 0.9 | 0.06692736202105463 |
BARTHOLOMEW STREET | BARTHOLOMEW STREET SHAWS COURT_1260.0_1200002.0 | 5848759c2c66dcdcda000168_1260.0_1200002 | BARTHOLOMEW STREET SHAWS COURT | 0.06117412360590356 | 0.9 | 0.05505671124531321 |
BARTHOLOMEW STREET | BARTHOLOMEW STREET STILLMANS COTTAGES_1260.0_1200002.0 | 5848759c2c66dcdcda000168_1260.0_1200002 | BARTHOLOMEW STREET STILLMANS COTTAGES | 0.05017107492325159 | 0.9 | 0.04515396743092643 |
@kmcdono2
Does there need to be a short explanation of the rationale for using 1851 parish boundaries for the whole period? And, if someone wanted to only work with a later census year, would it be useful to explain some of the limitations/challenges of using 1851 boundaries for, say, 1901 census data?
The parish boundaries are for 1851 but I-CeM uses that file to create the consistent parish boundaries for each census year. Here we dissolve the 1851 parishes to create the correct consistent parish boundaries for the given census year (later we create a union with the RSDs to break the big consistent parishes up a bit more).
I'll take a look at the readme again and clarify this!