PPH3 / Eldritch

3 stars 0 forks source link

Structural Mark-up #4

Open mjb232 opened 8 years ago

mjb232 commented 8 years ago

Alright Lovecraft, you're killing me. So first off the texts utilize the wrong types of quotations, not only should you find and replace the starting curly quotes (just copy and paste them into the find window), but you will also have to replace the ending curly quotes. Apparently they are two different characters. While we need to change these quotes, I'd hold off on doing it right away. Thats a very distictive element of the text, and thus makes a really good foot hold for regEx!

On top of this there are instances, like Pat mentioned in the meeting, where Lovecraft doesn't finish his quotes! Why you do this to me Lovecraft?

This makes marking up quotations with regEx rather difficult. This begs the question of whether or not we should just read through the whole thing and mark-up manually. While this is plausible with Call and Shadows, It could prove time consuming to do it with Mountains (I won't even bring up Charles Dextar Ward).

Whether or not we do this manually or using regEx, I think we definitely need to mark which quotes are "broken" and which are not. A simple type="broken" or type="closed" attribute could accomplish this (I don't know if there's a TEI tag for something like this. Perhaps something @etj27 could look up?).

There are also some issues with the paragraphs in the text. We've been working with texts that have had visible, blank lines between each paragraph. If @PPH3 is having issues marking up the paragraphs, text me and I can help you sort it out. Its not to terribly hard, just sort of tricky.

ebeshero commented 8 years ago

@mjb232 @PPH3 @etj27 Actually the use of a different curly quote for start vs. end is (usually) good: You really want that to help signal the distinction between the start vs. the end of a spoken passage, and ideally the best kind of digital edition preserves those distinctions. So let's not change that, but use it as something to make the regex work easier!

What's more problematic, as you notice here, Matt, is the lack of end quotes. That could be something sloppy about the source text you're working with, so you want to compare to another text if you can. Did you take a look at the texts available on this Lovecraft GitHub repo: https://github.com/nathanielksmith/lovecraftcorpus ? That's one of the hits that turned up when I ran a search on "Lovecraft" in all the GitHub repos.

ebeshero commented 8 years ago

@mjb232 @PPH3 @etj27 I decided to take a look at "The Call of Cthulu" on that alternate Lovecraft GitHub repo I've been telling you about: https://raw.githubusercontent.com/nathanielksmith/lovecraftcorpus/master/cthulhu.txt I downloaded it and surveyed the quotation marks. They are all straight quotes in this text, the kind we use in coding that don't distinguish a curly left from a curly right quote. But there are an even number of quotation marks (50 is my count), and it looks to me as if every quote is finished. I could be wrong about that, though, so Matt--take a look and see if you think this text is more accurate in closing all the quotes. Those curly quotes are nice to have, but when they don't close, that's a big problem--if you can find a text where all the quotes are at least complete, that is going to be a LOT easier to work with.

The issue here may not be Lovecraft at all, but simply typography conventions and starting with a particular edition that didn't have a high production standard. The question I have is whether the texts in Nathaniel Smith's repo are a little cleaner and better to use. You should take a look and compare.

mjb232 commented 8 years ago

@PPH3 Here's the regEx I used to grab the paragraphs. To make things more readable make sure you have the line wrap toggled on, which is CTRl+SHIFT+Y on PC

Search:\n\s{3,}([A-Z])

Replace: </p>\n\n<p>\1