FreeUKGen / FreeCENMigration

Issue tracking for project migrating FreeCEN to FreeCEN2 genealogy record database and search engine architecture. Code developed here is based on that developed in MyopicVicar
https://www.freecen.org.uk
Apache License 2.0
4 stars 3 forks source link

Ability to load a parms file directly into RC2 based on TNA extracts #833

Closed Captainkirkdawson closed 4 years ago

Captainkirkdawson commented 4 years ago

Currently Parms files are loaded into FC2 from FC1 after FC1 has been updated. They are incorporated into FC2 through the overall FC2 monthly update. We need to be able to a)load a parms file directly into FC2 and b) this needs to be independent of the overall FC2 update. This is an urgent requirement for loading 1901 and 1911 Params files and to allow for checking of CVS uploads of 1901 and 1911 records.

richardofsussex commented 4 years ago

I would be happy for @Captainkirkdawson to produce a new spec for the parms data. I am limited in what I can contribute by my ignorance both of the source formats we are working from, and the target information processing environment they end up in. A 'statement of purpose' for the parms, summarizing their use both for data entry/validation and for searching/retrieval, would I think be a useful start. @geoffj-FUG's sterling work on the Ireland situation should I suggest be taken into account. Over the weekend I have been trying, without much success, to get higher-level records (counties and Registration Districts) out of the TNA Discovery API. I'm assuming that it would be useful to have a complete geographical hierarchy for each census year. If we had this, could we just indicate where each Piece fits into the hierarchy? As regards the serialization of the new parms, I would personally prefer XML (which is the format I work in all the time), so it would be useful to have at least the definition of an XML version which I can use as a target, even if the results are then converted to json.

Captainkirkdawson commented 4 years ago

@richardofsussex one of my questions for you was going to be can you obtain the Registration District so you were already ahead of me by looking at trying to extract that. Perhaps you could point me towards a document on the API as that might (might not) help me.

richardofsussex commented 4 years ago

This page gives you access to the API Sandbox: https://discovery.nationalarchives.gov.uk/API/sandbox/index#/SearchRecords It includes fields for you to fill in, and shows you what the result will be. It also displays your search as a URL and as a call to curl, so you can 'take it away' and run it in a different environment. I have complete files for 1841-1891 which came from their TSV files, and which might be an easier source of Reg. Districts. I'm currently investigating the 'children' API command, which gives me a list of county records (level 4). The Reg. Districts are level 5 records.

Captainkirkdawson commented 4 years ago

Thanks Richard A quick scan indicates that the cost of my getting to understand that api would far far exceed the benefit. I will leave in your very capable hands

richardofsussex commented 4 years ago

I think I can see a way to extract Reg. Districts via a two-stage XML extraction process. If this is definitely needed I will plough on with it. I'll do 1901 and 1911 first. Does it make sense to extract the TNA identifier for each unit, as well as its title? This would give us access to the full record, both for display, as a potential link for our users, and for programmatic access to the data it contains. The approach could be used at lower levels, e.g. parishes, and would remove one of my concerns which is the fact that the parish name, of itself, is not a unique identifier (i.e. there are several with the same name, sometimes in the same county!).

Captainkirkdawson commented 4 years ago

@richardofsussex Is the TNA identifier the C13335 in the following example? https://discovery.nationalarchives.gov.uk/browse/r/h/C13335 This number changes as one drills down to the actual piece so it is a reference to a specific page The following image has a number C3139040 piece

Now before I go further in specification do you get access to the titles of the parents?

parents

ie the LONDON-MIDDLESEX and the Registration District 1.KENSINGTON? I know you extract the RG 10/2 Registration Sub-District 1B St Mary Paddington. and the Civil Parish, Township or Place: Paddington I presume you can also extract any notes that follow the Paddington

eg the (3) in

comment

That may also be (part) followed by another comment

Last point; there is no benefit to going away from csv; the team is happy with that format and there is enough new stuff going that introducing a new one to them is not worthwhile

richardofsussex commented 4 years ago

Yes, the 'C' numbers are record identifiers within the TNA catalogue. Being an archival recording system, there is a hierarchical relationship between these records, and each has a specific 'level' recorded.

In my earlier work I was using the record search facility, which returns an XML response including a RecordSearchResultViewModels element for each record. This search works nicely for level 6 records, i.e. Registration Sub-Districts, but doesn't seem to be able to find higher-level records. These search results do not include links to their parent record.

More recently, I have been exploring the 'children of' option, starting from the top level, e.g. RG10, and hoping to work all the way down to Sub-Districts. This search returns InformationAssetIdentityViewModel elements, which do have a ParentId specified. However, since we are attacking them 'from above', we know that parent identity already.

I am currently stalled by a combination of minor irritations. I want to recursively call for child records using the XSLT document() function. By default the API delivers JSON, and I have bought a licence allowing me to use the XSLT 3.0 facility to load a JSON string, but for some reason my XML editor won't accept that I have this licence. I can't load XML because the URL syntax doesn't support that.

Happy to stick with CSV.

richardofsussex commented 4 years ago

OK, I have finally succeeded in extracting what I think is all the data we need from the TNA Discovery catalogue, for 1901. I did this (in case you're interested) by applying an XSLT 3.0 transform to the top-level XML export, then recursively asking for the child records at each level until I got to Sub-District. I'm not suggesting that we actually load the information in this format; at this point in the proceedings I just want agreement that I have extracted everything we need. Although I had a learning experience working with JSON in XSLT, this did have the advantage that the individual parishes are all neatly surrounded by paragraph markup in this format - which is not the case with their XML download format. rg13-overview.zip

richardofsussex commented 4 years ago

rg14-overview.zip Here is RG14 (1911) in the same format. Does anyone know what the codes (e.g. RD 1 RS 1 ED 1) mean?

richardofsussex commented 4 years ago

Assuming I have captured all the information that is required, it should be a quick and easy job to convert these XML 'overview' documents into CSV PARMS files in @Captainkirkdawson's preferred format. You might like to think about having a single CSV load for each census year, rather than uploading the data county by county.

richardofsussex commented 4 years ago

Rats: I have just realised that these results may not be complete (default limit on search results ...). So just comment on the format and content, please.

richardofsussex commented 4 years ago

rg14-overview.zip rg13-overview.zip These should be complete. 1901 looks pretty good: there are some missing labels at the end of 1911.

geoffj-FUG commented 4 years ago

Richard

ED1 is the Enumeration District. In 1911 there was a one to one relationship with the piece number.

EDs start at 1 for each Registration sub-District.

There are several Registration Districts to a Registration District. In 1911 the Registration Districts each had a number.

So it is part of the hierarchy County, Registration District, Registration sub-District, ED number.

Geoff

Captainkirkdawson commented 4 years ago

@richardofsussex At this point after a 5 minute review is to say "Hats off". That is magnificent. More comments to follow but that you but it has all of what I was going to ask for and a little more besides

Captainkirkdawson commented 4 years ago

@richardofsussex as Geoff indicated the RD 1 RS 1 ED 10 in 1911 refer back to the Regional District; Sub District, Enumeration District number (a number within the subdistrict)

Captainkirkdawson commented 4 years ago

@richardofsussex I am tending at this point to say just provide the yearly xml and we extract into the database from the XML directly rather than converting and reconverting. Which based on all of this could be usefully reshaped. How about trying 1861?

richardofsussex commented 4 years ago

Will do, but I'll start by tidying up 1911 a bit. There is missing data at the end, and now I know what the codes mean I think I'll make them into an attribute rather than a 'parish'.

geoffj-FUG commented 4 years ago

Richard

I have just opened your XML file in Dreamweaver (the limit of my expertise I am afraid) and the structure looks good to me. It is the complete hierarchy.

If I look beyond basic FreeCEN functionality to future use as Open Data the structure you have been able to extract will enable a wide range of targeted data to be obtained from the FreeCEN database.

Geoff

richardofsussex commented 4 years ago

rg14-overview.zip rg14-overview-2.zip Here is 1911 in more complete form, with the naval and military entries (which had a different structure from normal parishes). The first file is the default result; the second moves the ED code up to an attribute.

richardofsussex commented 4 years ago

To my delight, the 1861 data gives a viable result first time. (Apart from the ships!) The pattern of data is different, e.g. `

Registration Sub-District 1 St Mary Paddington. Parish: Paddington (9); Hamlet: Kensal Green (part) (Divided between RG 9/1, 14, 39 and 785). ` I'll write a tidying-up XSLT to put the sub-district names where they ought to be. I can take the unnecessary 'headings' out of the data and split up the parish entries. Tempted to name the elements after the heading they have been given, so the second entry (Kensal Green) would become a 'hamlet' element - then you'll know which are the civil parishes. Good plan? [rg9-overview.zip](https://github.com/FreeUKGen/FreeCENMigration/files/4592475/rg9-overview.zip)
richardofsussex commented 4 years ago

rg9-overview-2.zip Updated RG9 overview for comments.

richardofsussex commented 4 years ago

rg10-overview-3.zip RG10 with tidying-up of parishes and hamlets.

Captainkirkdawson commented 4 years ago

@richardofsussex I need to take stock of what you have provided to date. So hold fire on tweaking just a little for feedback. Will work 14/13/9/10 if this is OK or do you want a different order? Need to have breakfast first

richardofsussex commented 4 years ago

No problem: I'm happy with that order. Currently working on 11.

richardofsussex commented 4 years ago

rg12-overview-3.zip rg11-overview-3.zip 11 and 12.

Captainkirkdawson commented 4 years ago

@richardofsussex wrt rg14-overview-2 Excellent but there is one common issue; the extraction of the code RD X RS Y ED Z is inconsistent. Look at lines 66350-66514 for examples. Appears that it only works if there is 1 parish; does not if 2 or more. The Royal Navy and Military . Each piece ends with the District number eg RD640 and 641. Would it be possible to place that in the District name and remove as parish

Captainkirkdawson commented 4 years ago

@richardofsussex wrt rg13-overview Usable as is

Captainkirkdawson commented 4 years ago

@richardofsussex wrt rg9-overview-2 a) Can we make the hamlet a child of the parish please b) Can we extract the Islands from the Isles of the British islands and their parish children

PS the tnaid is an absolute godsend; just hope they never change it!!

Captainkirkdawson commented 4 years ago

@richardofsussex wrt rg10-overview-3 a)Really need the District name as well as its number eg district name name b)Like rg9 hamlet to be a child of the parish c) For Royal Navy the ships; should be the district name ships shipsa

richardofsussex commented 4 years ago

@Captainkirkdawson thank you very much for the comments. I'm packing up now for the day, but I'll finish with a not-quite-ready 1841 - the townships need separating into separate elements. Tomorrow is a Bank Holiday, so it will probably be next week before I get back to this. Still, lots of progress! 1841-overview-2.zip

Captainkirkdawson commented 4 years ago

@richardofsussex wrt rg11-overview-3 a) we are missing the county name b) as in rg10 we are missing the district name c) we are missing the subdistrict name d) when fixing c) please ensure that royal navy subdistrict name is Ships as noted for rg10

Captainkirkdawson commented 4 years ago

@richardofsussex wrt rg12-overview-3 a) as in rg10 and 11 we are missing the district name b) missing district number

Captainkirkdawson commented 4 years ago

@richardofsussex Thank you for the work todate I have plenty to get on with here and also an issue on REG that is urgent Like to compliment you on the work to date we are extremely close to an excellent solution that will give us a firm base for the future and some new capabilities. Have a good weekend and avoid the virus

richardofsussex commented 4 years ago

1841-overview-3.zip Split the townships while the leeks were cooking!

Captainkirkdawson commented 4 years ago

@richardofsussex wrt 1841-overview-3 As you have correctly noted there were no district or subdistricts in 1841. They had Hundred or Wapentake depending upon the county. Soke and Liberty were also possible units but appear to have been associated with a hamlet or township. For our use we should use the term that is in the census ie Hundreds or Wapentake as a child within a county. It will be useful to display this to the researcher. It will have a name associated with it and have parishes as children. The parish may have hamlets or townships as children. In some counties there is also ALLOTMENTS IN FENS this should be treated as a child

richardofsussex commented 4 years ago

It looks as though a recording policy is emerging as we work through the data. You're encouraging me to retain the original naming of place types and to express the hierarchical relationships between them where known/knowable. Thus far I haven't always done this, tending more towards harmonization of their varying practice. Also (now I come to think about it) I have used different encoding approaches at different levels. Would it make sense to make our approach consistent, by putting all place names into a <place> element, with type and name attributes? Then when working with the data, you could choose either to notice or to ignore the place type.

FreecenBren commented 4 years ago

Well said Richard. I think that is a really good idea.

richardofsussex commented 4 years ago

Thus instead of: `

Great Barford Colmworth (3)Eaton-Socon Wyboston Goldington ` we would have: ` ` (Note the township now correctly within the parish: this would be hard to do cleanly with our current encoding strategy.) The important thing is that the data should be in a form which is most useful for the various tasks it will have to perform, both for input/validation and for searching/browsing/retrieval.
Captainkirkdawson commented 4 years ago

@richardofsussex I have to say that I am now 100% confused. At the start of this week I had been working on a new specification for parms csv files that were to be populated by the API extracts. All census for England from 1851 -1911 have essentially the same structure. 1841 is uniquely different. I was evolving toward Census/Country/District/Subdistrict/Parish,/Hamlet/Notes fields which would fit into a revised database structure. I never got to the point of writing that specification with the receipt of the revised extracts based on the parent child extracts in xml. These were seen as likely making the csv specification irrelevant since xml files can be directly converted into a ruby hash and is based on TNA structure. At no stage in this have I used the term place. That was deliberate because unfortunately place means something different to everyone; even TNA avoids its use except in the occasional text. It is a term that causes miscommunication throughout FreeUKGEN. The new proposal appears to have a flat structure with every element a place. The actual coding is beyond my level of understanding of xml with the trailing / and the place closure eg Programmatic processing of such a file is currently conceptually beyond me. Perhaps a game of golf will clear my brain.

richardofsussex commented 4 years ago

image @Captainkirkdawson no, the suggested structure is anything but flat. I apologize: I omitted the end-tags from my example and then GitHub lumped all the markup together. Hopefully the image above makes my intention clearer. This is only a suggestion, inspired in part by your suggestion that we should retain the distinction between hundreds, wapentakes, sokes, etc. for 1841. I'm happy to go the other way; towards a more uniform structure.

richardofsussex commented 4 years ago

1851-overview-2.zip @Captainkirkdawson here is 1851 (the last one to be addressed). I have made the XML with specific tags for each level: District, Sub-District, Parish and Hamlet. They should all have a name attribute, and possibly piece (Districts only), code and note attributes. Is this a pattern which you would be happy for me to apply to all the later years? If we can agree on that format, then we can look at 1841 as a special case and decide how to deal with it.

richardofsussex commented 4 years ago

rg14-overview-4.zip @Captainkirkdawson I've made the changes you requested to rg14. Also placed the parish name in a 'name' attribute, and split off comments in brackets into a 'note' attribute.

Captainkirkdawson commented 4 years ago

@richardofsussex wrt 1851-overview-2 A quick review shows a problem. There are no parishes for many sub districts. Just look at the first 2. I suspect that parish is not being extracted if there is no hamlet.

richardofsussex commented 4 years ago

Good catch: I think I can fix this quickly.

richardofsussex commented 4 years ago

1851-overview-2.zip Here we go ...

Captainkirkdawson commented 4 years ago

@richardofsussex New version of 1851-overview-2 looks to be fine. Like the notes extraction. Will not know more until I try to process into the database. Will not be attempting that for a while as I have too much on my plate rg14-overview-4 has addressed the issues raised and looks OK subject to same caveat above

richardofsussex commented 4 years ago

That's fine: no rush. So long as I know you are happy with the format I am producing, I can continue to tidy up the other years. So long as the XML is consistently structured, it wouldn't be a big job for me to tweak it, were there to be problems when you come to load it.

richardofsussex commented 4 years ago

rg9-overview-4.zip Here is rg9, with the islands and the name/note analysis. I tried and failed in a search for the ships: "recorded elsewhere" ...

richardofsussex commented 4 years ago

rg10-overview-6.zip Here is the updated rg10, with District names restored, hamlets within parishes and the Royal Navy districts recorded as "Ships".

richardofsussex commented 4 years ago

rg10-overview-6.zip Ditto with the ships actually present! (They were recorded as hamlets, so were stripped out by a routine which expected there always to be a parish present.)