WaxCylinderRevival / frus-dates-project

Project repository for FRUS date extraction and normalization initiative
https://history.state.gov
GNU General Public License v3.0
1 stars 0 forks source link

Add missing `dateline` in frus1920v02 #1008

Closed WaxCylinderRevival closed 7 years ago

WaxCylinderRevival commented 7 years ago
vak2ve commented 7 years ago

@WaxCylinderRevival sorry for missing several of these! Looks like they're all in main divs, not frus:attachments, which may be a pattern you encounter going forward. Could I check that in the later Q3 volumes or would you prefer me to stay out of that batch while you're working on it?

WaxCylinderRevival commented 7 years ago
WaxCylinderRevival commented 7 years ago

@vak2ve, no worries! I'll take care of the Q3 volumes.

If you wouldn't mind checking the Q4 batch, that would be great!

WaxCylinderRevival commented 7 years ago
WaxCylinderRevival commented 7 years ago
WaxCylinderRevival commented 7 years ago
vak2ve commented 7 years ago

On it! Is there a strategy or XPath you use to catch the ones I missed or do you go doc by doc?

On Fri, Oct 6, 2017 at 3:22 PM, Amanda Ross notifications@github.com wrote:

@vak2ve https://github.com/vak2ve, no worries! I'll take care of the Q3 volumes.

If you wouldn't mind checking the Q4 batch, that would be great!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WaxCylinderRevival/frus-dates-project/issues/1008#issuecomment-334847184, or mute the thread https://github.com/notifications/unsubscribe-auth/AIdG6caFTRCKJA5tGInFE6Tmtjgri9cwks5spn3zgaJpZM4Pw5X7 .

WaxCylinderRevival commented 7 years ago

@vak2ve, I use an XQuery script to identify potential candidates and then evaluate the flagged docs. I might be able to borrow some of the regex to give you an XPath that might help. Let me see...

WaxCylinderRevival commented 7 years ago

To find dates in postscript of historical documents without date:

//div[attribute::type='document'][not(attribute::subtype='editorial-note')][not(descendant::date)]//postscript[matches(.,
'\d{1,2}[(st)(nd)(rd)(th)]*\s+(January|February|March|April|May|June|July|August|September|October|November|December),*\s+\d{4}|((January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2}[(st)(nd)(rd)(th)]*,\s+\d{4})')]

[N.B. Edited to add qualifiers above.]

@vak2ve

WaxCylinderRevival commented 7 years ago

To find dates in last paragraphs of historical documents without date:

//div[attribute::type='document'][not(attribute::subtype='editorial-note')][not(descendant::date)]//p[last()][matches(.,
'\d{1,2}[(st)(nd)(rd)(th)]*\s+(January|February|March|April|May|June|July|August|September|October|November|December),*\s+\d{4}|((January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2}[(st)(nd)(rd)(th)]*,\s+\d{4})')]

[N.B. Edited to add not to descendant::date]

@vak2ve

WaxCylinderRevival commented 7 years ago

@vak2ve, apologies for the many edits, but I think these two XPaths are now qualified enough to be helpful when used via the XPath/XQuery Builder in oxygenXML.

vak2ve commented 7 years ago

@WaxCylinderRevival thank you so much--that second one in particular will be really helpful. Postscripts and frus:attachments are usually pretty straightforward but it's those last paragraphs of random docs that slip by me!

WaxCylinderRevival commented 7 years ago

@vak2ve, if you'd like to experiment with these regex in frus:attachment, you may wish to try:

To find date candidates in last paragraphs of attachments without date:

//div[attribute::type='document'][not(attribute::subtype='editorial-note')]//*[local-name()='attachment'][not(descendant::date)]//p[last()][matches(.,
'\d{1,2}[(st)(nd)(rd)(th)]*\s+(January|February|March|April|May|June|July|August|September|October|November|December),*\s+\d{4}|((January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2}[(st)(nd)(rd)(th)]*,\s+\d{4})')]

To find date candidates in postscripts of attachments without date:

//div[attribute::type='document'][not(attribute::subtype='editorial-note')]//*[local-name()='attachment'][not(descendant::date)]//postscript[matches(.,
'\d{1,2}[(st)(nd)(rd)(th)]*\s+(January|February|March|April|May|June|July|August|September|October|November|December),*\s+\d{4}|((January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2}[(st)(nd)(rd)(th)]*,\s+\d{4})')]
WaxCylinderRevival commented 7 years ago

@vak2ve, I added a page for useful XPaths on the Wiki: https://github.com/WaxCylinderRevival/frus-dates-project/wiki/Useful-XPaths

vak2ve commented 7 years ago

This is a fantastic resource! Thank you so much for putting it together--I'll put these XPaths into practice immediately, and double-check the other Q4 volumes with them too.

WaxCylinderRevival commented 7 years ago

Feel free to add to the page, if you have XPath tools you use!