earlyprint / earlyprint.github.io

Homepage for the EarlyPrint Project: Curating and Exploring Early Printed English
https://earlyprint.org/
2 stars 2 forks source link

update #24

Closed martinmueller39 closed 4 years ago

martinmueller39 commented 4 years ago

Here is what I'm planning to do this weeK: we want to push out a version of TCP Phase 1 on the NU servers and hope that this will be a good enough solution through much of next year. I will create a version of the texts that use the NUPOS 3 tags. These are renamed to make editing easier. They are not redefined, with the exception of a new 'nnp' tag that captures 'le', 'de', etc when they are part of names.

Before doing this I want to reread through the 50,000 most common combinations of a spelling, pos tag, lemma, and regular spelling in texts before 1640--basically the STC texts. Roughly speaking, these are spellings that occur at least 250 times in texts before 1640, and they add up to 96% of all tokens. Reading through lists like that without context is an efficient way of catching crude errors and replacing them with corrections that may be right but are certainly less wrong. A spelling that occurs 250 times in texts before 1640 has a lower relative frequency than a spelling that occurs once in the Shakespeare canon.

With luck I'll finish this week

jrladd commented 4 years ago

Will these new texts also replace the current Bitbucket texts?

martinmueller39 commented 4 years ago

Yes

From: JR Ladd notifications@github.com Reply-To: "earlyprint/earlyprint.github.io" reply@reply.github.com Date: Monday, November 11, 2019 at 9:23 AM To: "earlyprint/earlyprint.github.io" earlyprint.github.io@noreply.github.com Cc: Martin Mueller martinmueller@northwestern.edu, Author author@noreply.github.com Subject: Re: [earlyprint/earlyprint.github.io] update (#24)

Will these new texts also replace the current Bitbucket texts?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_earlyprint_earlyprint.github.io_issues_24-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DABL7UL4T3UQEBRIQPCCAXPTQTF2GRA5CNFSM4JLW5QG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDXE5MQ-23issuecomment-2D552488626&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=yCQEtvhChTFF8GzfgtZLiaOBrVz6E-rqNbfPKuyPLUM&s=m8JjY25FHaZYtw_jI1duaoiTcAVCjeMkcrAiQUtfc_s&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABL7UL6KMPH35APM5WWT3GLQTF2GRANCNFSM4JLW5QGQ&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=rG8zxOdssqSzDRz4x1GLlmLOW60xyVXydxwnJZpkxbk&m=yCQEtvhChTFF8GzfgtZLiaOBrVz6E-rqNbfPKuyPLUM&s=_G8aNq5PyYZvuEJ-cEdDBY9Au-bLD2VPb_PusjdYlXM&e=.

craigberry commented 4 years ago

These texts have been on BitBucket and in the eXist app for some time.