Develop 9. Next Steps 2: Named Entity Recognition

drjwbaker commented 4 years ago

https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/_episodes/10-named-entity-recognition.md

Based on rough outline at https://github.com/CatalogueLegacies/antconc.github.io/issues/13#issuecomment-726187081 tasks are:

[x] write 1
[x] refine NER process for use in 'advanced' callout
- mk folder NER
- unzip download in it https://nlp.stanford.edu/software/CRF-NER.shtml#Download
- put custom batchner in there (with thanks to https://github.com/collectionslab/batchner)
- run sh SCRIPT. Output 'DATASET_people.txt' and 'DATASET_places.txt'
[x] make NERd BL-IAMS datasets: https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/data/IAMS_Photographs_1850-1950_selection3.txt_people.txt and https://github.com/CatalogueLegacies/antconc.github.io/blob/gh-pages/data/IAMS_Photographs_1850-1950_selection3.txt_places.txt
[x] write 2
[ ] write 3
- [x] make a start
- [x] include note that we've sculpted the tasks carefully to ensure that antconc does not hang/crash
- [ ] add a second task (that may take a little longer to load!) based on https://github.com/CatalogueLegacies/antconc.github.io/issues/13#issuecomment-733129737
[x] sanity check

drjwbaker commented 4 years ago

Based on https://github.com/CatalogueLegacies/antconc.github.io/issues/34#issuecomment-698846025 review project priority from 'Would like to have' to a 'Dev task'.

drjwbaker commented 4 years ago

Tested Standford NER, forked @brandontlocke's batch script, and added version that produces text files (line separated) for person markup and location markup respectively https://github.com/CatalogueLegacies/batchner

This should be sufficient for the NER module, perhaps showing how the data is made (seen as the shell is beyond the scope of the lesson), then putting the data back into AntConc to examine word use around marked up entities.

drjwbaker commented 4 years ago

(NB: I ran out of talent when writing batchner_markup.sh as seems to produce extra white space somewhere around the sed command, so it needs a little fettling before use in the module)

drjwbaker commented 3 years ago

rough outline

Intro to NER. What it does: add tags to a corpus (show an e.g.). We can put tags and corpus linguistics together in interesting ways. For example, if you wanted to know the language used around all places, we can do that, as the tags give us an anchor to search around. In AntConc, we'd search for the tag and then look at the concordance.
To generate NER tags you need a little shell (like Episode 8, have an advanced callout for running NER over BL-IAMS) using shell. Then provide NERd BL-IAMS dataset, to put into AntConc, to search on a tag (place), then look at the results in a few different views. Task here: create search to find '-ly' words around person tags.
Uses of NER+AntConc. a) checking variety of place names in a catalogue - e.g. use Concordance Plot to see if there are suspiciously un-NERd parts of the catalogue to look for place names off the computational grid, that might be missed in a semi-automated record improvement initiative. b) find discrepancies in how types of people are discussed, by looking at adjectives around persons, and sorting a concordance by name - cleaning/repairing legacy data.

drjwbaker commented 3 years ago

On 3., @rossi-uk try:

upload the IAMS_people txt file
Global Settings. Set to 'Token Definition' of 'letter' only.
Tool Prefs. For Wordlist, untick 'treat all as lower case' as usual.
Then in Concordance tab search: by */PERSON with words and case selected, and Level 1 set to 1L
This is a good start on the kind of contexts in which people appear (and is an essay search for antconc!)

drjwbaker commented 3 years ago

This also works nicely in the 'Collocates' with the 'window span' set to the left of the search term. Screenshot from 2020-11-20 16-09-20

drjwbaker commented 3 years ago

but then stuff like the attached (warning, takes ages) show the limits of the approach (that is, it doesn't pick up initials before names or titles, so the 'by' above will be only an impression) Screenshot from 2020-11-20 16-22-26

drjwbaker commented 3 years ago

With the _places file, we can do things like this Screenshot from 2020-11-20 16-53-17

rossi-uk commented 3 years ago

In terms of numbers there are 45574 instances of /PERSON and yes, AntConc does take a while and appears to crash. It appears initials are tagged separately. NER1

rossi-uk commented 3 years ago

There are 21082 instances of /LOCATION Below an example of person initials mistaken for location NER2

rossi-uk commented 3 years ago

With the _places file, we can do things like this Other examples: outside /LOCATION (26 instances), the village of /LOCATION (), towards */LOCATION (89 instances)

drjwbaker commented 3 years ago

Positive thing: 'the village of */LOCATION'
Negative about NER: mistakes around initial, AntConc is good for looking at transformations of data using tools we don't fully understand. Need for specialist gazateers for locations (point people to PH lesson on this)

drjwbaker commented 3 years ago

Point of episode is about using AntConc to work with marked up text and/or third party tools.

drjwbaker commented 3 years ago

Info on better NER https://github.com/impresso/named-entity-tutorial-dh2019/tree/master/slides

drjwbaker commented 3 years ago

@rossi-uk I had a run through this afternoon and we are more or less there with this episode! Just needs a second task (which I think you said you'd make a start on, but do correct me if I'm wrong!) and a run through (because you know my ability to introduce stupid errors!)

drjwbaker commented 3 years ago

@rossi-uk We have an action to add a second task? Do we want to still to do that, or shall we close without it?

rossi-uk commented 3 years ago

I was thinking of a task around identifying mentions of women for example searching for Lady xxx and percentage of total references to women recognised by the NER. Considering the omissions. AntConc is struggling with processing queries on my machine. Some observations.

NER does not tag all women names, so a search query for titles can help find names and note the omissions by the NER tool
A search for Lady|Dame|Queen|Princess|Marquess|Baroness|Empress|Maharana|Mrs|Miss brings up over 2000 hits (knowing the data helps us use a wider range of titles; large part of the corpus is about nineteenth century British India - Maharana, references to Queen Victoria)

omissions by NER

-Note NER patterns: it recognises names when title is followed by initial: Miss C./PERSON Frere/PERSON; Mrs H.C./PERSON Noyce/PERSON

Over a quarter of the women seem to have the title Lady (see notes in task 2, again some omissions are noted)
To find Lady or Mrs Richardson you need to search for Lady/PERSON Richardson or Mrs/PERSON Richardson - Lady|Mrs/PERSON */PERSON ?

Scarcity of adjectives with names? Punctuation around names, some are lists, and or with Mrs /PERSON collocates with Lady /PERSON ...

CatalogueLegacies / antconc.github.io

Develop 9. Next Steps 2: Named Entity Recognition #13

rough outline

omissions by NER