Open drjwbaker opened 4 years ago
Based on https://github.com/CatalogueLegacies/antconc.github.io/issues/34#issuecomment-698846025 review project priority from 'Would like to have' to a 'Dev task'.
Tested Standford NER, forked @brandontlocke's batch script, and added version that produces text files (line separated) for person markup and location markup respectively https://github.com/CatalogueLegacies/batchner
This should be sufficient for the NER module, perhaps showing how the data is made (seen as the shell is beyond the scope of the lesson), then putting the data back into AntConc to examine word use around marked up entities.
(NB: I ran out of talent when writing batchner_markup.sh
as seems to produce extra white space somewhere around the sed
command, so it needs a little fettling before use in the module)
Intro to NER. What it does: add tags to a corpus (show an e.g.). We can put tags and corpus linguistics together in interesting ways. For example, if you wanted to know the language used around all places, we can do that, as the tags give us an anchor to search around. In AntConc, we'd search for the tag and then look at the concordance.
To generate NER tags you need a little shell (like Episode 8, have an advanced callout for running NER over BL-IAMS) using shell. Then provide NERd BL-IAMS dataset, to put into AntConc, to search on a tag (place), then look at the results in a few different views. Task here: create search to find '-ly' words around person tags.
Uses of NER+AntConc. a) checking variety of place names in a catalogue - e.g. use Concordance Plot
to see if there are suspiciously un-NERd parts of the catalogue to look for place names off the computational grid, that might be missed in a semi-automated record improvement initiative. b) find discrepancies in how types of people are discussed, by looking at adjectives around persons, and sorting a concordance by name - cleaning/repairing legacy data.
On 3., @rossi-uk try:
by */PERSON
with words and case selected, and Level 1 set to 1LThis also works nicely in the 'Collocates' with the 'window span' set to the left of the search term.
but then stuff like the attached (warning, takes ages) show the limits of the approach (that is, it doesn't pick up initials before names or titles, so the 'by' above will be only an impression)
With the _places file, we can do things like this
In terms of numbers there are 45574 instances of /PERSON and yes, AntConc does take a while and appears to crash. It appears initials are tagged separately.
There are 21082 instances of /LOCATION Below an example of person initials mistaken for location
With the _places file, we can do things like this Other examples: outside /LOCATION (26 instances), the village of /LOCATION (), towards */LOCATION (89 instances)
Point of episode is about using AntConc to work with marked up text and/or third party tools.
Info on better NER https://github.com/impresso/named-entity-tutorial-dh2019/tree/master/slides
@rossi-uk I had a run through this afternoon and we are more or less there with this episode! Just needs a second task (which I think you said you'd make a start on, but do correct me if I'm wrong!) and a run through (because you know my ability to introduce stupid errors!)
@rossi-uk We have an action to add a second task? Do we want to still to do that, or shall we close without it?
I was thinking of a task around identifying mentions of women for example searching for Lady xxx and percentage of total references to women recognised by the NER. Considering the omissions. AntConc is struggling with processing queries on my machine. Some observations.
-Note NER patterns: it recognises names when title is followed by initial: Miss C./PERSON Frere/PERSON; Mrs H.C./PERSON Noyce/PERSON
Scarcity of adjectives with names? Punctuation around names, some are lists, and or with Mrs /PERSON collocates with Lady /PERSON ...
https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/_episodes/10-named-entity-recognition.md
Based on rough outline at https://github.com/CatalogueLegacies/antconc.github.io/issues/13#issuecomment-726187081 tasks are:
sh SCRIPT
. Output 'DATASET_people.txt' and 'DATASET_places.txt'