word fragments in texts before 1550

martinmueller39 commented 4 years ago

There are a lot of word division problems in the TCP corpus but it gets worse as you move back in time and is quite bad for texts before 1550. Things like 'de fect' or 'po wer'. I fix them as I come across them and sometimes chase down patterns. Are there heuristics for going about this in a more coordinated manner? I'm skeptical but willing to try out things. One very brute method would be to run through the entire corpus, concatenate each word with the following word and look for results that match the most common 100,00 or 250,000 spellings. There may be refinements to catch obvious garbage before the output, but it may take less time to throw out the garbage afterwards.

One could let these sleeping dogs lie, but each 'dog' causes two errors, and in the much sparser corpus of the late 15th and early 16th century the total errors add up to non-trivial disruption.

I'll be grateful for any advice.

pibburns commented 4 years ago

You can improve the brute force method you suggest by dividing the words in the text into sentences. Then, determine the principal language of each sentence. For English (or predominantly English) sentences, you create a running list of 2-grams and 3-grams and join the n-grams into potential words. You then compare these with known good words.

For texts in which the word IDs are in reading order, so that text in jump tags are easily moved out of the way as needed, extracting sentences is fairly straightforward. I have code that does that, and also code which tries to determine the main language(s) for a sentence. This is more difficult when the word IDs do not reflect the reading order.

Implementing the dictionary/lexicon lookups for ngrams is also straightfoward.

There still leaves a lot of manual reviewing to ensure any automatic joins make sense.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edu

From: martinmueller39 notifications@github.com Sent: Sunday, November 10, 2019 9:30 PM To: earlyprint/earlyprint.github.io earlyprint.github.io@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

There are a lot of word division problems in the TCP corpus but it gets worse as you move back in time and is quite bad for texts before 1550. Things like 'de fect' or 'po wer'. I fix them as I come across them and sometimes chase down patterns. Are there heuristics for going about this in a more coordinated manner? I'm skeptical but willing to try out things. One very brute method would be to run through the entire corpus, concatenate each word with the following word and look for results that match the most common 100,00 or 250,000 spellings. There may be refinements to catch obvious garbage before the output, but it may take less time to throw out the garbage afterwards.

One could let these sleeping dogs lie, but each 'dog' causes two errors, and in the much sparser corpus of the late 15th and early 16th century the total errors add up to non-trivial disruption.

I'll be grateful for any advice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABMDOIAUOU22TXTUVWKHQGTQTDGWXA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYJRUNA&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=VZTB4hF1TI3RJMM9QTn0ihRA6klJQlauKTXgfDPtOMA&s=a_d5ZcpbExXm-JvURSacBjyrZ9ifNIWUq9kY-KJbwfA&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABMDOICKF75IRB565XBAE3LQTDGWXANCNFSM4JLQAJCA&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=VZTB4hF1TI3RJMM9QTn0ihRA6klJQlauKTXgfDPtOMA&s=hlkGnK2RgByVW4_WxQfaapD_yeW3sNvCSbUauAcels4&e=.

jfloewen commented 4 years ago

That concatenation solution is quite ingenious, but ripe for error. I can think of a lot of modern modern words that are frequently treated as two words well into the late 16th C.

no where here with there to whom ever a while her self for ever etc.

I suspect that leaving these as two words presents difficulties for Morphadorner, but I think we should keep them separate until they are historically concatenated.

On Nov 10, 2019, at 9:30 PM, martinmueller39 notifications@github.com<mailto:notifications@github.com> wrote:

There are a lot of word division problems in the TCP corpus but it gets worse as you move back in time and is quite bad for texts before 1550. Things like 'de fect' or 'po wer'. I fix them as I come across them and sometimes chase down patterns. Are there heuristics for going about this in a more coordinated manner? I'm skeptical but willing to try out things. One very brute method would be to run through the entire corpus, concatenate each word with the following word and look for results that match the most common 100,00 or 250,000 spellings. There may be refinements to catch obvious garbage before the output, but it may take less time to throw out the garbage afterwards.

One could let these sleeping dogs lie, but each 'dog' causes two errors, and in the much sparser corpus of the late 15th and early 16th century the total errors add up to non-trivial disruption.

I'll be grateful for any advice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/earlyprint/earlyprint.github.io/issues/23?email_source=notifications&email_token=AMBORAS4SYHR7TUYJLJVKVTQTDGWXA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYJRUNA, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMBORATSYAYJVSAAOOOKWBLQTDGWXANCNFSM4JLQAJCA.

martinmueller39 commented 4 years ago

Actually, MorphAdorner treats some of these frozen phrases as single tokens but maintains the orthographic representation as separate words. If you look for the history of reflexive pronouns, it gives you all the data right away, and you can then determine whether or how the presence or absence of a space before the ‘self’ suffix expresses a perception of phrase as one or two words. The reflexive pronouns are especially interesting: they can be analyzed as a noun modified by a possessive pronoun. But this doesn’t work for ‘themselves’.

From: jfloewen notifications@github.com Reply-To: "earlyprint/earlyprint.github.io" reply@reply.github.com Date: Monday, November 11, 2019 at 7:10 AM To: "earlyprint/earlyprint.github.io" earlyprint.github.io@noreply.github.com Cc: Martin Mueller martinmueller@northwestern.edu, Author author@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

That concatenation solution is quite ingenious, but ripe for error. I can think of a lot of modern modern words that are frequently treated as two words well into the late 16th C.

no where here with there to whom ever a while her self for ever etc.

I suspect that leaving these as two words presents difficulties for Morphadorner, but I think we should keep them separate until they are historically concatenated.

On Nov 10, 2019, at 9:30 PM, martinmueller39 notifications@github.com<mailto:notifications@github.com> wrote:

There are a lot of word division problems in the TCP corpus but it gets worse as you move back in time and is quite bad for texts before 1550. Things like 'de fect' or 'po wer'. I fix them as I come across them and sometimes chase down patterns. Are there heuristics for going about this in a more coordinated manner? I'm skeptical but willing to try out things. One very brute method would be to run through the entire corpus, concatenate each word with the following word and look for results that match the most common 100,00 or 250,000 spellings. There may be refinements to catch obvious garbage before the output, but it may take less time to throw out the garbage afterwards.

One could let these sleeping dogs lie, but each 'dog' causes two errors, and in the much sparser corpus of the late 15th and early 16th century the total errors add up to non-trivial disruption.

I'll be grateful for any advice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/earlyprint/earlyprint.github.io/issues/23?email_source=notifications&email_token=AMBORAS4SYHR7TUYJLJVKVTQTDGWXA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYJRUNA, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMBORATSYAYJVSAAOOOKWBLQTDGWXANCNFSM4JLQAJCA.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7UL6ANHHGU2MUKKGC5NTQTFKTHA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDWY25Q-23issuecomment-2D552439158&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=RGUtvWsEGbp1pRQBIarJDDkHlcr1lULnaXS4w85MrhU&s=FzvmxUSCE4gKmmA57X2H27FroCgOPzBH8TFMxobmlG8&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7UL66PZTDOVRBAAQTMH3QTFKTHANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=RGUtvWsEGbp1pRQBIarJDDkHlcr1lULnaXS4w85MrhU&s=yIKSWADyuNWfwWEnRIQEtTjNoSj3pXcxsV2zSOP_Fkg&e=.

martinmueller39 commented 4 years ago

In most cases a split word will be a token that occurs elsewhere in the text. So I wonder whether it would be better to do something like the following:

Run through the text and make a word list
Create bigrams (and perhaps trigrams) and check them against the word list of the text
Don’t use bigrams that cross element boundaries

The third will miss some oddities but will also be a junk filter. I have come across a few cases where transcribers put a between the two parts of a word wrongly split by the printer. It’ also possible to use a word list that only has items above some frequency cut off.

I’ll play around with this a little.

From: pibburns notifications@github.com Reply-To: "earlyprint/earlyprint.github.io" reply@reply.github.com Date: Sunday, November 10, 2019 at 11:14 PM To: "earlyprint/earlyprint.github.io" earlyprint.github.io@noreply.github.com Cc: Martin Mueller martinmueller@northwestern.edu, Author author@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

You can improve the brute force method you suggest by dividing the words in the text into sentences. Then, determine the principal language of each sentence. For English (or predominantly English) sentences, you create a running list of 2-grams and 3-grams and join the n-grams into potential words. You then compare these with known good words.

For texts in which the word IDs are in reading order, so that text in jump tags are easily moved out of the way as needed, extracting sentences is fairly straightforward. I have code that does that, and also code which tries to determine the main language(s) for a sentence. This is more difficult when the word IDs do not reflect the reading order.

Implementing the dictionary/lexicon lookups for ngrams is also straightfoward.

There still leaves a lot of manual reviewing to ensure any automatic joins make sense.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edu

From: martinmueller39 notifications@github.com Sent: Sunday, November 10, 2019 9:30 PM To: earlyprint/earlyprint.github.io earlyprint.github.io@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

There are a lot of word division problems in the TCP corpus but it gets worse as you move back in time and is quite bad for texts before 1550. Things like 'de fect' or 'po wer'. I fix them as I come across them and sometimes chase down patterns. Are there heuristics for going about this in a more coordinated manner? I'm skeptical but willing to try out things. One very brute method would be to run through the entire corpus, concatenate each word with the following word and look for results that match the most common 100,00 or 250,000 spellings. There may be refinements to catch obvious garbage before the output, but it may take less time to throw out the garbage afterwards.

One could let these sleeping dogs lie, but each 'dog' causes two errors, and in the much sparser corpus of the late 15th and early 16th century the total errors add up to non-trivial disruption.

I'll be grateful for any advice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABMDOIAUOU22TXTUVWKHQGTQTDGWXA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYJRUNA&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=VZTB4hF1TI3RJMM9QTn0ihRA6klJQlauKTXgfDPtOMA&s=a_d5ZcpbExXm-JvURSacBjyrZ9ifNIWUq9kY-KJbwfA&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABMDOICKF75IRB565XBAE3LQTDGWXANCNFSM4JLQAJCA&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=VZTB4hF1TI3RJMM9QTn0ihRA6klJQlauKTXgfDPtOMA&s=hlkGnK2RgByVW4_WxQfaapD_yeW3sNvCSbUauAcels4&e=.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7UL2CXUDKTPTTZLRTZ4TQTDS3XA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDVV6PY-23issuecomment-2D552296255&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=fvuT5fHN58-meBVWjsSKLncPGV9xfCQ4ornSgKCsd3I&s=nxAwYbJWJKgLrd4yhnYoQHF1T0P3Aa6hOTU6AWhwlMY&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7UL757NXEU333GT7LOUDQTDS3XANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=fvuT5fHN58-meBVWjsSKLncPGV9xfCQ4ornSgKCsd3I&s=wHOvE3FGxP4dPeRdpwIrHkrKc-dbIfGZ6_MZKej3x1U&e=.

jrladd commented 4 years ago

The processes that @martinmueller39 and @pibburns propose make good sense to me.

To deal with the issue that @jfloewen brings up (in case MorphAdorner hasn't already handled these in some cases), I suggest running through the text and looking only at "words" with unlikely English character sequences, i.e. "po," "wer," and "fect." Then we could look around those char sequences for joins that make valid words. This is just a modification of @pibburns's method that would focus us on the most egregious examples and possibly prevent any overcorrection.

dknoxwu commented 4 years ago

I've been meaning to ask about something related to this. If I recall correctly, this sounds like the sort of thing that the 2013 MorphAdorner calculated when handling end-of-line hyphenation. I have seen examples where spaces that don't belong match up with line breaks that we would expect to be hyphenated. Is there any leverage in MorphAdorner's existing routines here? Maybe not, when the transcript didn't observe a line break, but in any case it might still be useful to know and document if we think that unobserved line breaks are likely the primary cause of the spaces that need correcting in these earlier texts. For example, the TCP transcription of the text below used vertical pipes for "spe|ciall" and "lord|shyp", but transcribed "great ly" as two words. spaces

pibburns commented 4 years ago

dknoxwu wrote on 11/14/2019 10:20 AM:

I've been meaning to ask about something related to this. If I recall correctly, this sounds like the sort of thing that the 2013 MorphAdorner calculated when handling end-of-line hyphenation. I have seen examples where spaces that don't belong match up with line breaks that we would expect to be hyphenated. Is there any leverage in MorphAdorner's existing routines here? Maybe not, when the transcript didn't observe a line break, but in any case it might still be useful to know and document if we think that unobserved line breaks are likely the primary cause of the spaces that need correcting in these earlier texts. For example, the TCP transcription of the text below used vertical pipes for "spe|ciall" and "lord|shyp", but transcribed "great ly" as two words. spaces https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_49412490_68874842-2D7c2ffe00-2D06c7-2D11ea-2D84c1-2De245bf862ce2.png&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=UCQi-IL7okMfEUvTDGUHbyfkAQ97UacO9mtwMZ8Ky8g&s=AV9d25L-Uq5nabpcA9TC691ZUZK4pa4isET0A48YVFo&e=

Way back when,when processing raw TCP texts, I just removed these vertical bars. That did not introduce any spaces, so something like lord|shyp was transformed to lordshyp before further processing.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edu

martinmueller39 commented 4 years ago

Doug is right. Many wrongly split words are the result of words wrapped around lines without any explicit hyphenation. I'm not sure whether this has practical consequences, because the evidence in most cases is overwhelming and doesn't require checking the image. Not unlike the whistleblower, whose evidence has now been multiply confirmed...

This is a good example where increased computing power changes the cost/benefit analysis of how to go about finding the culprits. I could spend hours or days trying to identify the likely culprits ahead of time, or I could spend less than an hour writing a script that will take between 6 and 10 hours to run on my machine and will generate a list of all concatenated bigrams that have a match in the 250,000 or 500,000 most common spellings. I think I'll do something like that. There is also the desirable effect of capturing lexical items that appear as single or double tokens ('to fore' 'vn to') etc.

From: pibburns notifications@github.com Sent: Thursday, November 14, 2019 10:36 AM To: earlyprint/earlyprint.github.io earlyprint.github.io@noreply.github.com Cc: Martin Mueller martinmueller@northwestern.edu; Mention mention@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

dknoxwu wrote on 11/14/2019 10:20 AM:

I've been meaning to ask about something related to this. If I recall correctly, this sounds like the sort of thing that the 2013 MorphAdorner calculated when handling end-of-line hyphenation. I have seen examples where spaces that don't belong match up with line breaks that we would expect to be hyphenated. Is there any leverage in MorphAdorner's existing routines here? Maybe not, when the transcript didn't observe a line break, but in any case it might still be useful to know and document if we think that unobserved line breaks are likely the primary cause of the spaces that need correcting in these earlier texts. For example, the TCP transcription of the text below used vertical pipes for "spe|ciall" and "lord|shyp", but transcribed "great ly" as two words. spaces https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_49412490_68874842-2D7c2ffe00-2D06c7-2D11ea-2D84c1-2De245bf862ce2.png&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=UCQi-IL7okMfEUvTDGUHbyfkAQ97UacO9mtwMZ8Ky8g&s=AV9d25L-Uq5nabpcA9TC691ZUZK4pa4isET0A48YVFo&e=

Way back when,when processing raw TCP texts, I just removed these vertical bars. That did not introduce any spaces, so something like lord|shyp was transformed to lordshyp before further processing.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edu

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7ULZ3B6AU3G2BWP2CMYTQTV47PA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECOS4Q-23issuecomment-2D553970034&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=ICOa8SGtgkm7MVtp1rIKJmVZ9vyd0EPKhDwR3Gr7mDo&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7ULZ6KM4NYAR4ME3BZULQTV47PANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=9IqRF_VapVv5Kfv2PMqhjB8sImLE_XTvtnC2pMWU4h4&e=.

dknoxwu commented 4 years ago

I had been thinking of the functionality in MorphAdorner around FindSoftHyphens, ExtractSoftHyphens,FixWordBreaks, etc., which decided when hyphens should or should not be dropped depending on quantitative observational evidence in the corpus.

The fragments around an unhyphenated word break will look similar to a reader, but I understand that in these cases there's nothing in the transcription to distinguish a space that was a line break from any other kind of space.

Scripting the investigation you describe makes good sense, and I can imagine it will be helpful to document the changes and save the data when you determine what works best.

On 11/14/19 11:41 AM, martinmueller39 wrote: Doug is right. Many wrongly split words are the result of words wrapped around lines without any explicit hyphenation. I'm not sure whether this has practical consequences, because the evidence in most cases is overwhelming and doesn't require checking the image. Not unlike the whistleblower, whose evidence has now been multiply confirmed...

This is a good example where increased computing power changes the cost/benefit analysis of how to go about finding the culprits. I could spend hours or days trying to identify the likely culprits ahead of time, or I could spend less than an hour writing a script that will take between 6 and 10 hours to run on my machine and will generate a list of all concatenated bigrams that have a match in the 250,000 or 500,000 most common spellings. I think I'll do something like that. There is also the desirable effect of capturing lexical items that appear as single or double tokens ('to fore' 'vn to') etc.

From: pibburns notifications@github.com mailto:notifications@github.com Sent: Thursday, November 14, 2019 10:36 AM To: earlyprint/earlyprint.github.io earlyprint.github.io@noreply.github.com mailto:earlyprint.github.io@noreply.github.com Cc: Martin Mueller martinmueller@northwestern.edu mailto:martinmueller@northwestern.edu; Mention mention@noreply.github.com mailto:mention@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

dknoxwu wrote on 11/14/2019 10:20 AM:

I've been meaning to ask about something related to this. If I recall correctly, this sounds like the sort of thing that the 2013 MorphAdorner calculated when handling end-of-line hyphenation. I have seen examples where spaces that don't belong match up with line breaks that we would expect to be hyphenated. Is there any leverage in MorphAdorner's existing routines here? Maybe not, when the transcript didn't observe a line break, but in any case it might still be useful to know and document if we think that unobserved line breaks are likely the primary cause of the spaces that need correcting in these earlier texts. For example, the TCP transcription of the text below used vertical pipes for "spe|ciall" and "lord|shyp", but transcribed "great ly" as two words. spaces https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_49412490_68874842-2D7c2ffe00-2D06c7-2D11ea-2D84c1-2De245bf862ce2.png&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=UCQi-IL7okMfEUvTDGUHbyfkAQ97UacO9mtwMZ8Ky8g&s=AV9d25L-Uq5nabpcA9TC691ZUZK4pa4isET0A48YVFo&e=https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_49412490_68874842-2D7c2ffe00-2D06c7-2D11ea-2D84c1-2De245bf862ce2.png&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=UCQi-IL7okMfEUvTDGUHbyfkAQ97UacO9mtwMZ8Ky8g&s=AV9d25L-Uq5nabpcA9TC691ZUZK4pa4isET0A48YVFo&e=

Way back when,when processing raw TCP texts, I just removed these vertical bars. That did not introduce any spaces, so something like lord|shyp was transformed to lordshyp before further processing.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edumailto:pib@northwestern.edu

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7ULZ3B6AU3G2BWP2CMYTQTV47PA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECOS4Q-23issuecomment-2D553970034&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=ICOa8SGtgkm7MVtp1rIKJmVZ9vyd0EPKhDwR3Gr7mDo&e=https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7ULZ3B6AU3G2BWP2CMYTQTV47PA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECOS4Q-23issuecomment-2D553970034&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=ICOa8SGtgkm7MVtp1rIKJmVZ9vyd0EPKhDwR3Gr7mDo&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7ULZ6KM4NYAR4ME3BZULQTV47PANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=9IqRF_VapVv5Kfv2PMqhjB8sImLE_XTvtnC2pMWU4h4&e=https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7ULZ6KM4NYAR4ME3BZULQTV47PANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=9IqRF_VapVv5Kfv2PMqhjB8sImLE_XTvtnC2pMWU4h4&e=.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/earlyprint/earlyprint.github.io/issues/23?email_source=notifications&email_token=ALY7TCX47AW42XWQY4WLJZLQTWEWJA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECVRGY#issuecomment-553998491, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALY7TCX73GHIBFN7U2IZIQDQTWEWJANCNFSM4JLQAJCA.

martinmueller39 commented 4 years ago

I hadn't thought of those. You know MOrphAdorner better than I do... I agree that it will be useful to turn the raw data from my proposed fishing trip into a proper report.

From: dknoxwu notifications@github.com Sent: Thursday, November 14, 2019 12:04 PM To: earlyprint/earlyprint.github.io earlyprint.github.io@noreply.github.com Cc: Martin Mueller martinmueller@northwestern.edu; Mention mention@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

I had been thinking of the functionality in MorphAdorner around FindSoftHyphens, ExtractSoftHyphens,FixWordBreaks, etc., which decided when hyphens should or should not be dropped depending on quantitative observational evidence in the corpus.

The fragments around an unhyphenated word break will look similar to a reader, but I understand that in these cases there's nothing in the transcription to distinguish a space that was a line break from any other kind of space.

Scripting the investigation you describe makes good sense, and I can imagine it will be helpful to document the changes and save the data when you determine what works best.

On 11/14/19 11:41 AM, martinmueller39 wrote: Doug is right. Many wrongly split words are the result of words wrapped around lines without any explicit hyphenation. I'm not sure whether this has practical consequences, because the evidence in most cases is overwhelming and doesn't require checking the image. Not unlike the whistleblower, whose evidence has now been multiply confirmed...

This is a good example where increased computing power changes the cost/benefit analysis of how to go about finding the culprits. I could spend hours or days trying to identify the likely culprits ahead of time, or I could spend less than an hour writing a script that will take between 6 and 10 hours to run on my machine and will generate a list of all concatenated bigrams that have a match in the 250,000 or 500,000 most common spellings. I think I'll do something like that. There is also the desirable effect of capturing lexical items that appear as single or double tokens ('to fore' 'vn to') etc.

From: pibburns notifications@github.com mailto:notifications@github.com Sent: Thursday, November 14, 2019 10:36 AM To: earlyprint/earlyprint.github.io earlyprint.github.io@noreply.github.com mailto:earlyprint.github.io@noreply.github.com Cc: Martin Mueller martinmueller@northwestern.edu mailto:martinmueller@northwestern.edu; Mention mention@noreply.github.com mailto:mention@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] word fragments in texts before 1550 (#23)

dknoxwu wrote on 11/14/2019 10:20 AM:

I've been meaning to ask about something related to this. If I recall correctly, this sounds like the sort of thing that the 2013 MorphAdorner calculated when handling end-of-line hyphenation. I have seen examples where spaces that don't belong match up with line breaks that we would expect to be hyphenated. Is there any leverage in MorphAdorner's existing routines here? Maybe not, when the transcript didn't observe a line break, but in any case it might still be useful to know and document if we think that unobserved line breaks are likely the primary cause of the spaces that need correcting in these earlier texts. For example, the TCP transcription of the text below used vertical pipes for "spe|ciall" and "lord|shyp", but transcribed "great ly" as two words. spaces https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_49412490_68874842-2D7c2ffe00-2D06c7-2D11ea-2D84c1-2De245bf862ce2.png&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=UCQi-IL7okMfEUvTDGUHbyfkAQ97UacO9mtwMZ8Ky8g&s=AV9d25L-Uq5nabpcA9TC691ZUZK4pa4isET0A48YVFo&e=https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_49412490_68874842-2D7c2ffe00-2D06c7-2D11ea-2D84c1-2De245bf862ce2.png&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=amBs3omh8kUTFR4g6MSKkXTA2I23_DyaK3qFS0QH9gU&m=UCQi-IL7okMfEUvTDGUHbyfkAQ97UacO9mtwMZ8Ky8g&s=AV9d25L-Uq5nabpcA9TC691ZUZK4pa4isET0A48YVFo&e=

Way back when,when processing raw TCP texts, I just removed these vertical bars. That did not introduce any spaces, so something like lord|shyp was transformed to lordshyp before further processing.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edumailto:pib@northwestern.edu

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7ULZ3B6AU3G2BWP2CMYTQTV47PA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECOS4Q-23issuecomment-2D553970034&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=ICOa8SGtgkm7MVtp1rIKJmVZ9vyd0EPKhDwR3Gr7mDo&e=https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7ULZ3B6AU3G2BWP2CMYTQTV47PA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECOS4Q-23issuecomment-2D553970034&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=ICOa8SGtgkm7MVtp1rIKJmVZ9vyd0EPKhDwR3Gr7mDo&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7ULZ6KM4NYAR4ME3BZULQTV47PANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=9IqRF_VapVv5Kfv2PMqhjB8sImLE_XTvtnC2pMWU4h4&e=https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7ULZ6KM4NYAR4ME3BZULQTV47PANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=URfPS626ln1lO1CEuixxNNmoXAtJEki8Z8yQf59qqzM&s=9IqRF_VapVv5Kfv2PMqhjB8sImLE_XTvtnC2pMWU4h4&e=.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/earlyprint/earlyprint.github.io/issues/23?email_source=notifications&email_token=ALY7TCX47AW42XWQY4WLJZLQTWEWJA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECVRGY#issuecomment-553998491, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALY7TCX73GHIBFN7U2IZIQDQTWEWJANCNFSM4JLQAJCA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_23-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7UL2Z3FBCGQNYHN4IVPLQTWHLNA5CNFSM4JLQAJCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECXZGI-23issuecomment-2D554007705&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=hKg9jhKeQubobWZaaYcJQFSVP_-kVyxTu6w50ZwLrbM&s=wLOg90eg5DDaK05iYMJxOman8th0XHjAbMKdwDQ1irU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7UL7TRJNATSF4ISAJDTLQTWHLNANCNFSM4JLQAJCA&d=DwMFaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=hKg9jhKeQubobWZaaYcJQFSVP_-kVyxTu6w50ZwLrbM&s=agcIn_NRX5tqM8KVHOkwxcYO9ZgDOFUEeyPbAXF5FsI&e=.

pibburns commented 4 years ago

martinmueller39 wrote on 11/14/2019 11:41 AM:

Doug is right. Many wrongly split words are the result of words wrapped around lines without any explicit hyphenation. I'm not sure whether this has practical consequences, because the evidence in most cases is overwhelming and doesn't require checking the image. Not unlike the whistleblower, whose evidence has now been multiply confirmed... If these were marked with vertical bars, this should not have happened. I should look back and try to reproduce the problem. I expect it won't recur with another collection, but I'd like to fix it anyway.

If the words were split with an end-of-line and no marker, that's not an easy thing to fix.

-- Philip R. "Pib" Burns Research Computing Services Northwestern University, Evanston, IL. USA pib@northwestern.edu

earlyprint / earlyprint.github.io

word fragments in texts before 1550 #23