Living-with-machines / T-Res

A Toponym Resolution Pipeline for Digitised Historical Newspapers
Other
7 stars 1 forks source link

Umbrella ticket: Output Format and Exploration #164

Open dcsw2 opened 1 year ago

dcsw2 commented 1 year ago

Add requests here for research requirements

kmcdono2 commented 1 year ago

Two core approaches to working with the toponyms:

  1. analyzing toponyms independently --> see discussion of this as viz in #166 This could provide sample-level information such as:
    • average #/min/max of toponyms per article and per title
    • spatial "center" of resolved toponyms (e.g. where do they cluster) per article and per title
    • how many toponyms of each type (location, building, street, etc.) in total sample

And, for every resolved and unresolved toponym in the corpus/sample:

  1. analyzing toponyms dependent on their linguistic context Basic format would allow us to analyze toponyms within the context of other words in the article. Literally what format this should be (e.g. XML) is what I'm not sure about. For the paper we are working on, this would ideally include (not a complete list, but a start):
    • access to POS tags --> answer is to run sentences with nltk POS tagger --> documented in #167
    • the ability to analyze text within a window to either side of any toponym
kallewesterling commented 1 year ago

I'd be happy to talk through the format question, if you want, @kmcdono2!

dcsw2 commented 1 year ago

I sent an invite

kallewesterling commented 1 year ago

Thank you @dcsw2 - looking forward to chatting!

kmcdono2 commented 1 year ago

Decision to divide this ticket into separate tasks. #166 for the items under (1). Next week we will discuss how to prepare for (2).

fedenanni commented 1 year ago

Hi all - I am picking up the "access to POS tags" and will keep track of progress in here