CatalogueLegacies / antconc.github.io

Computational Analysis of Catalogue Data
https://cataloguelegacies.github.io/antconc.github.io/
Other
4 stars 1 forks source link

Develop 9. Next Steps 2: Named Entity Recognition #13

Open drjwbaker opened 4 years ago

drjwbaker commented 4 years ago

https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/_episodes/10-named-entity-recognition.md

Based on rough outline at https://github.com/CatalogueLegacies/antconc.github.io/issues/13#issuecomment-726187081 tasks are:

drjwbaker commented 4 years ago

Based on https://github.com/CatalogueLegacies/antconc.github.io/issues/34#issuecomment-698846025 review project priority from 'Would like to have' to a 'Dev task'.

drjwbaker commented 4 years ago

Tested Standford NER, forked @brandontlocke's batch script, and added version that produces text files (line separated) for person markup and location markup respectively https://github.com/CatalogueLegacies/batchner

This should be sufficient for the NER module, perhaps showing how the data is made (seen as the shell is beyond the scope of the lesson), then putting the data back into AntConc to examine word use around marked up entities.

drjwbaker commented 4 years ago

(NB: I ran out of talent when writing batchner_markup.sh as seems to produce extra white space somewhere around the sed command, so it needs a little fettling before use in the module)

drjwbaker commented 3 years ago

rough outline

  1. Intro to NER. What it does: add tags to a corpus (show an e.g.). We can put tags and corpus linguistics together in interesting ways. For example, if you wanted to know the language used around all places, we can do that, as the tags give us an anchor to search around. In AntConc, we'd search for the tag and then look at the concordance.

  2. To generate NER tags you need a little shell (like Episode 8, have an advanced callout for running NER over BL-IAMS) using shell. Then provide NERd BL-IAMS dataset, to put into AntConc, to search on a tag (place), then look at the results in a few different views. Task here: create search to find '-ly' words around person tags.

  3. Uses of NER+AntConc. a) checking variety of place names in a catalogue - e.g. use Concordance Plot to see if there are suspiciously un-NERd parts of the catalogue to look for place names off the computational grid, that might be missed in a semi-automated record improvement initiative. b) find discrepancies in how types of people are discussed, by looking at adjectives around persons, and sorting a concordance by name - cleaning/repairing legacy data.

drjwbaker commented 3 years ago

On 3., @rossi-uk try:

drjwbaker commented 3 years ago

This also works nicely in the 'Collocates' with the 'window span' set to the left of the search term. Screenshot from 2020-11-20 16-09-20

drjwbaker commented 3 years ago

but then stuff like the attached (warning, takes ages) show the limits of the approach (that is, it doesn't pick up initials before names or titles, so the 'by' above will be only an impression) Screenshot from 2020-11-20 16-22-26

drjwbaker commented 3 years ago

With the _places file, we can do things like this Screenshot from 2020-11-20 16-53-17

rossi-uk commented 3 years ago

In terms of numbers there are 45574 instances of /PERSON and yes, AntConc does take a while and appears to crash. It appears initials are tagged separately. NER1

rossi-uk commented 3 years ago

There are 21082 instances of /LOCATION Below an example of person initials mistaken for location NER2

rossi-uk commented 3 years ago

With the _places file, we can do things like this Other examples: outside /LOCATION (26 instances), the village of /LOCATION (), towards */LOCATION (89 instances) NER3

drjwbaker commented 3 years ago
  1. Positive thing: 'the village of */LOCATION'
  2. Negative about NER: mistakes around initial, AntConc is good for looking at transformations of data using tools we don't fully understand. Need for specialist gazateers for locations (point people to PH lesson on this)
drjwbaker commented 3 years ago

Point of episode is about using AntConc to work with marked up text and/or third party tools.

drjwbaker commented 3 years ago

Info on better NER https://github.com/impresso/named-entity-tutorial-dh2019/tree/master/slides

drjwbaker commented 3 years ago

@rossi-uk I had a run through this afternoon and we are more or less there with this episode! Just needs a second task (which I think you said you'd make a start on, but do correct me if I'm wrong!) and a run through (because you know my ability to introduce stupid errors!)

drjwbaker commented 3 years ago

@rossi-uk We have an action to add a second task? Do we want to still to do that, or shall we close without it?

rossi-uk commented 3 years ago

I was thinking of a task around identifying mentions of women for example searching for Lady xxx and percentage of total references to women recognised by the NER. Considering the omissions. AntConc is struggling with processing queries on my machine. Some observations.

image

omissions by NER

image

-Note NER patterns: it recognises names when title is followed by initial: Miss C./PERSON Frere/PERSON; Mrs H.C./PERSON Noyce/PERSON

Scarcity of adjectives with names? Punctuation around names, some are lists, and or with Mrs /PERSON collocates with Lady /PERSON ...