XML Parsing (Drama) - Githubissues

ljekersey commented 9 years ago

When we parse plays, we're getting the speaker's name before every line of dialogue (so a very high frequency of abbreviated names), character entrance and exit cues, "act" and "scene" headers, etc. This is probley why we got that large drama cluster in the ECCO test. So we're seeing the document's "format" rather than seeing content, if that's the right way to phrase this. So we're also seeing things like chapter headings in prose works.

How should we address this? Exclude certain tags like "Speaker," "Stage" and "Head?"

Kevin-Damazyn commented 9 years ago

Do we want title page information?

<div type="titlepage">
            <pb facs="2" rend="none"/>
            <p>THE Hospital-Surgeon: OR, A New, Gentle, and Easie Way, to Cure speedily all Sorts of <hi>Wounds,</hi> and other <hi>Di<lb rend="hidden" type="hyphenInWord"/>seases</hi> belonging to SURGERY.</p>
            <p>ALSO, A Discourse on <hi>Discover'd Bones</hi>; and a Way to Dress, after Trepanning, with a new <hi>Instru<lb rend="hidden" type="hyphenInWord"/>ment</hi> invented by the Author.</p>
            <p>In THREE PARTS.</p>
            <list>
               <item>I. The Advantages of this <hi>Way,</hi> and Mischiefs of a contrary Practice propos'd and confirm'd by <hi>Reason</hi> and <hi>Authority.</hi>
               </item>...

It also has a "tothereader" section? "preface"? "tableofcontents" (I think we said we didn’t want this)? I guess instead of saying what we do not want, would it be easier to say what we do want?

Kevin-Damazyn commented 9 years ago

well I meant for that to be xml but apparently github doesn't sanitize xml code. so it looks like plain text...but the question is still the same.

mtabor150 commented 9 years ago

You have to escape tags :

\<head>\</head>

mtabor150 commented 9 years ago

Also, you can wrap to whole text in triple ` ticks

\```
text
\```

remove the backslashes

Kevin-Damazyn commented 9 years ago

gotcha. I am trying to get down to a certain tag in the tree and it is pretty hairy. Might be doing something wrong but I think automating this through x number to texts will significantly slow down the file processing. It is almost like we would have to go to the leaves and enter sentence by sentence. I am 98 percent sure we can skip the , , tags, but I am not sure how right now. To skip them or just to get to them, we have to go pretty far down the tree.

mtabor150 commented 9 years ago

We just have to incorporate regular expressions for each thing to remove.

SLU-TMI / TextMining.jl

XML Parsing (Drama) #56