Closed Captainkirkdawson closed 4 years ago
I would be happy for @Captainkirkdawson to produce a new spec for the parms data. I am limited in what I can contribute by my ignorance both of the source formats we are working from, and the target information processing environment they end up in. A 'statement of purpose' for the parms, summarizing their use both for data entry/validation and for searching/retrieval, would I think be a useful start. @geoffj-FUG's sterling work on the Ireland situation should I suggest be taken into account. Over the weekend I have been trying, without much success, to get higher-level records (counties and Registration Districts) out of the TNA Discovery API. I'm assuming that it would be useful to have a complete geographical hierarchy for each census year. If we had this, could we just indicate where each Piece fits into the hierarchy? As regards the serialization of the new parms, I would personally prefer XML (which is the format I work in all the time), so it would be useful to have at least the definition of an XML version which I can use as a target, even if the results are then converted to json.
@richardofsussex one of my questions for you was going to be can you obtain the Registration District so you were already ahead of me by looking at trying to extract that. Perhaps you could point me towards a document on the API as that might (might not) help me.
This page gives you access to the API Sandbox: https://discovery.nationalarchives.gov.uk/API/sandbox/index#/SearchRecords It includes fields for you to fill in, and shows you what the result will be. It also displays your search as a URL and as a call to curl, so you can 'take it away' and run it in a different environment. I have complete files for 1841-1891 which came from their TSV files, and which might be an easier source of Reg. Districts. I'm currently investigating the 'children' API command, which gives me a list of county records (level 4). The Reg. Districts are level 5 records.
Thanks Richard A quick scan indicates that the cost of my getting to understand that api would far far exceed the benefit. I will leave in your very capable hands
I think I can see a way to extract Reg. Districts via a two-stage XML extraction process. If this is definitely needed I will plough on with it. I'll do 1901 and 1911 first. Does it make sense to extract the TNA identifier for each unit, as well as its title? This would give us access to the full record, both for display, as a potential link for our users, and for programmatic access to the data it contains. The approach could be used at lower levels, e.g. parishes, and would remove one of my concerns which is the fact that the parish name, of itself, is not a unique identifier (i.e. there are several with the same name, sometimes in the same county!).
@richardofsussex Is the TNA identifier the C13335 in the following example? https://discovery.nationalarchives.gov.uk/browse/r/h/C13335 This number changes as one drills down to the actual piece so it is a reference to a specific page The following image has a number C3139040
Now before I go further in specification do you get access to the titles of the parents?
ie the LONDON-MIDDLESEX and the Registration District 1.KENSINGTON? I know you extract the RG 10/2 Registration Sub-District 1B St Mary Paddington. and the Civil Parish, Township or Place: Paddington I presume you can also extract any notes that follow the Paddington
eg the (3) in
That may also be (part) followed by another comment
Last point; there is no benefit to going away from csv; the team is happy with that format and there is enough new stuff going that introducing a new one to them is not worthwhile
Yes, the 'C' numbers are record identifiers within the TNA catalogue. Being an archival recording system, there is a hierarchical relationship between these records, and each has a specific 'level' recorded.
In my earlier work I was using the record search facility, which returns an XML response including a RecordSearchResultViewModels element for each record. This search works nicely for level 6 records, i.e. Registration Sub-Districts, but doesn't seem to be able to find higher-level records. These search results do not include links to their parent record.
More recently, I have been exploring the 'children of' option, starting from the top level, e.g. RG10, and hoping to work all the way down to Sub-Districts. This search returns InformationAssetIdentityViewModel elements, which do have a ParentId specified. However, since we are attacking them 'from above', we know that parent identity already.
I am currently stalled by a combination of minor irritations. I want to recursively call for child records using the XSLT document() function. By default the API delivers JSON, and I have bought a licence allowing me to use the XSLT 3.0 facility to load a JSON string, but for some reason my XML editor won't accept that I have this licence. I can't load XML because the URL syntax doesn't support that.
Happy to stick with CSV.
OK, I have finally succeeded in extracting what I think is all the data we need from the TNA Discovery catalogue, for 1901. I did this (in case you're interested) by applying an XSLT 3.0 transform to the top-level XML export, then recursively asking for the child records at each level until I got to Sub-District. I'm not suggesting that we actually load the information in this format; at this point in the proceedings I just want agreement that I have extracted everything we need. Although I had a learning experience working with JSON in XSLT, this did have the advantage that the individual parishes are all neatly surrounded by paragraph markup in this format - which is not the case with their XML download format. rg13-overview.zip
rg14-overview.zip Here is RG14 (1911) in the same format. Does anyone know what the codes (e.g. RD 1 RS 1 ED 1) mean?
Assuming I have captured all the information that is required, it should be a quick and easy job to convert these XML 'overview' documents into CSV PARMS files in @Captainkirkdawson's preferred format. You might like to think about having a single CSV load for each census year, rather than uploading the data county by county.
Rats: I have just realised that these results may not be complete (default limit on search results ...). So just comment on the format and content, please.
rg14-overview.zip rg13-overview.zip These should be complete. 1901 looks pretty good: there are some missing labels at the end of 1911.
Richard
ED1 is the Enumeration District. In 1911 there was a one to one relationship with the piece number.
EDs start at 1 for each Registration sub-District.
There are several Registration Districts to a Registration District. In 1911 the Registration Districts each had a number.
So it is part of the hierarchy County, Registration District, Registration sub-District, ED number.
Geoff
@richardofsussex At this point after a 5 minute review is to say "Hats off". That is magnificent. More comments to follow but that you but it has all of what I was going to ask for and a little more besides
@richardofsussex as Geoff indicated the RD 1 RS 1 ED 10 in 1911 refer back to the Regional District; Sub District, Enumeration District number (a number within the subdistrict)
@richardofsussex I am tending at this point to say just provide the yearly xml and we extract into the database from the XML directly rather than converting and reconverting. Which based on all of this could be usefully reshaped. How about trying 1861?
Will do, but I'll start by tidying up 1911 a bit. There is missing data at the end, and now I know what the codes mean I think I'll make them into an attribute rather than a 'parish'.
Richard
I have just opened your XML file in Dreamweaver (the limit of my expertise I am afraid) and the structure looks good to me. It is the complete hierarchy.
If I look beyond basic FreeCEN functionality to future use as Open Data the structure you have been able to extract will enable a wide range of targeted data to be obtained from the FreeCEN database.
Geoff
rg14-overview.zip rg14-overview-2.zip Here is 1911 in more complete form, with the naval and military entries (which had a different structure from normal parishes). The first file is the default result; the second moves the ED code up to an attribute.
To my delight, the 1861 data gives a viable result first time. (Apart from the ships!) The pattern of data is different, e.g.
`
rg9-overview-2.zip Updated RG9 overview for comments.
rg10-overview-3.zip RG10 with tidying-up of parishes and hamlets.
@richardofsussex I need to take stock of what you have provided to date. So hold fire on tweaking just a little for feedback. Will work 14/13/9/10 if this is OK or do you want a different order? Need to have breakfast first
No problem: I'm happy with that order. Currently working on 11.
rg12-overview-3.zip rg11-overview-3.zip 11 and 12.
@richardofsussex wrt rg14-overview-2 Excellent but there is one common issue; the extraction of the code RD X RS Y ED Z is inconsistent. Look at lines 66350-66514 for examples. Appears that it only works if there is 1 parish; does not if 2 or more. The Royal Navy and Military . Each piece ends with the District number eg RD640 and 641. Would it be possible to place that in the District name and remove as parish
@richardofsussex wrt rg13-overview Usable as is
@richardofsussex wrt rg9-overview-2 a) Can we make the hamlet a child of the parish please b) Can we extract the Islands from the Isles of the British and their parish children
PS the tnaid is an absolute godsend; just hope they never change it!!
@richardofsussex wrt rg10-overview-3 a)Really need the District name as well as its number eg b)Like rg9 hamlet to be a child of the parish c) For Royal Navy the ships; should be the district name
@Captainkirkdawson thank you very much for the comments. I'm packing up now for the day, but I'll finish with a not-quite-ready 1841 - the townships need separating into separate elements. Tomorrow is a Bank Holiday, so it will probably be next week before I get back to this. Still, lots of progress! 1841-overview-2.zip
@richardofsussex wrt rg11-overview-3 a) we are missing the county name b) as in rg10 we are missing the district name c) we are missing the subdistrict name d) when fixing c) please ensure that royal navy subdistrict name is Ships as noted for rg10
@richardofsussex wrt rg12-overview-3 a) as in rg10 and 11 we are missing the district name b) missing district number
@richardofsussex Thank you for the work todate I have plenty to get on with here and also an issue on REG that is urgent Like to compliment you on the work to date we are extremely close to an excellent solution that will give us a firm base for the future and some new capabilities. Have a good weekend and avoid the virus
1841-overview-3.zip Split the townships while the leeks were cooking!
@richardofsussex wrt 1841-overview-3 As you have correctly noted there were no district or subdistricts in 1841. They had Hundred or Wapentake depending upon the county. Soke and Liberty were also possible units but appear to have been associated with a hamlet or township. For our use we should use the term that is in the census ie Hundreds or Wapentake as a child within a county. It will be useful to display this to the researcher. It will have a name associated with it and have parishes as children. The parish may have hamlets or townships as children. In some counties there is also ALLOTMENTS IN FENS this should be treated as a child
It looks as though a recording policy is emerging as we work through the data. You're encouraging me to retain the original naming of place types and to express the hierarchical relationships between them where known/knowable. Thus far I haven't always done this, tending more towards harmonization of their varying practice. Also (now I come to think about it) I have used different encoding approaches at different levels. Would it make sense to make our approach consistent, by putting all place names into a <place>
element, with type and name attributes? Then when working with the data, you could choose either to notice or to ignore the place type.
Well said Richard. I think that is a really good idea.
Thus instead of:
`
@richardofsussex I have to say that I am now 100% confused. At the start of this week I had been working on a new specification for parms csv files that were to be populated by the API extracts. All census for England from 1851 -1911 have essentially the same structure. 1841 is uniquely different. I was evolving toward Census/Country/District/Subdistrict/Parish,/Hamlet/Notes fields which would fit into a revised database structure. I never got to the point of writing that specification with the receipt of the revised extracts based on the parent child extracts in xml. These were seen as likely making the csv specification irrelevant since xml files can be directly converted into a ruby hash and is based on TNA structure.
At no stage in this have I used the term place. That was deliberate because unfortunately place means something different to everyone; even TNA avoids its use except in the occasional text. It is a term that causes miscommunication throughout FreeUKGEN.
The new proposal appears to have a flat structure with every element a place. The actual coding is beyond my level of understanding of xml with the trailing / and the place closure eg
@Captainkirkdawson no, the suggested structure is anything but flat. I apologize: I omitted the end-tags from my example and then GitHub lumped all the markup together. Hopefully the image above makes my intention clearer. This is only a suggestion, inspired in part by your suggestion that we should retain the distinction between hundreds, wapentakes, sokes, etc. for 1841. I'm happy to go the other way; towards a more uniform structure.
1851-overview-2.zip @Captainkirkdawson here is 1851 (the last one to be addressed). I have made the XML with specific tags for each level: District, Sub-District, Parish and Hamlet. They should all have a name attribute, and possibly piece (Districts only), code and note attributes. Is this a pattern which you would be happy for me to apply to all the later years? If we can agree on that format, then we can look at 1841 as a special case and decide how to deal with it.
rg14-overview-4.zip @Captainkirkdawson I've made the changes you requested to rg14. Also placed the parish name in a 'name' attribute, and split off comments in brackets into a 'note' attribute.
@richardofsussex wrt 1851-overview-2 A quick review shows a problem. There are no parishes for many sub districts. Just look at the first 2. I suspect that parish is not being extracted if there is no hamlet.
Good catch: I think I can fix this quickly.
1851-overview-2.zip Here we go ...
@richardofsussex New version of 1851-overview-2 looks to be fine. Like the notes extraction. Will not know more until I try to process into the database. Will not be attempting that for a while as I have too much on my plate rg14-overview-4 has addressed the issues raised and looks OK subject to same caveat above
That's fine: no rush. So long as I know you are happy with the format I am producing, I can continue to tidy up the other years. So long as the XML is consistently structured, it wouldn't be a big job for me to tweak it, were there to be problems when you come to load it.
rg9-overview-4.zip Here is rg9, with the islands and the name/note analysis. I tried and failed in a search for the ships: "recorded elsewhere" ...
rg10-overview-6.zip Here is the updated rg10, with District names restored, hamlets within parishes and the Royal Navy districts recorded as "Ships".
rg10-overview-6.zip Ditto with the ships actually present! (They were recorded as hamlets, so were stripped out by a routine which expected there always to be a parish present.)
Currently Parms files are loaded into FC2 from FC1 after FC1 has been updated. They are incorporated into FC2 through the overall FC2 monthly update. We need to be able to a)load a parms file directly into FC2 and b) this needs to be independent of the overall FC2 update. This is an urgent requirement for loading 1901 and 1911 Params files and to allow for checking of CVS uploads of 1901 and 1911 records.