Living-with-machines / T-Res

A Toponym Resolution Pipeline for Digitised Historical Newspapers
https://living-with-machines.github.io/T-Res/
Other
7 stars 1 forks source link

Process Experimental Sample 1 #178

Open dcsw2 opened 1 year ago

dcsw2 commented 1 year ago

21 Feb: a first sample to run through T-RES

dcsw2 commented 1 year ago

SAMPLE REQUEST 1

-HMD+LWM collections only -Date range: 1880-1900 -for every title, take 7 random days per year; this gives 7 issues. For each issue include all articles, retaining metadata about issues, e.g. we want to know that articles belong to issues) -all OCR qualities

NB: the objects of inquiry are both article and issue, so it's important to select content within 7 issues

Is below the right set of tasks? Please amend as needed!

kmcdono2 commented 1 year ago

~~We have some thoughts/questions about how to define "1 week":

npedrazzini commented 1 year ago

Sounds good @dcsw2 , I can do that. I can start working on it late this afternoon... if I start a script tonight you might have the sample sometime tomorrow. I'll keep you updated but ping me for anything else in the meantime - I'll be a bit busy with last-minute abstract writing and wrapping up stuff before I switch to part-time next week, but it's on my TODO for the day :white_check_mark:

kmcdono2 commented 1 year ago

T-Res output + article metadata fields:

NLP,issue,art_num,title,collection,full_date,year,month,day,location,word_count,ocrquality,decade, mention, candidates, candidate_names, sent_idx, end_pos, tag, sentence, prediction, prediction_name, ed_score, latlong, wkdt_class Including toponym mentions that return NIL candidates Amended to leave out POS until @dcsw2 and I discuss

kmcdono2 commented 1 year ago

Sounds good @dcsw2 , I can do that. I can start working on it late this afternoon..

Sample in google drive here: https://drive.google.com/drive/folders/1GCQJXT2ZI_EtGgHQeqOyn6TYe4Ww7lQI

Sample stored in azure here: storageexplorer://v=1&accountid=%2Fsubscriptions%2Fb8871872-a5e3-473f-b9b9-f4baaab6a9a0%2FresourceGroups%2Flivingwithmachines%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2Flivingwithmachines&subscriptionid=b8871872-a5e3-473f-b9b9-f4baaab6a9a0&resourcetype=Azure.BlobContainer&resourcename=topo

kmcdono2 commented 1 year ago

(just leaving @fedenanni and @lukehare assigned as they are active on this right now) - @fedenanni when you're ready for @dcsw2 and I to review, just re-assign us! I'm trying to get better at this ;)