Open ljekersey opened 9 years ago
Do we want title page information?
<div type="titlepage">
<pb facs="2" rend="none"/>
<p>THE Hospital-Surgeon: OR, A New, Gentle, and Easie Way, to Cure speedily all Sorts of <hi>Wounds,</hi> and other <hi>Di<lb rend="hidden" type="hyphenInWord"/>seases</hi> belonging to SURGERY.</p>
<p>ALSO, A Discourse on <hi>Discover'd Bones</hi>; and a Way to Dress, after Trepanning, with a new <hi>Instru<lb rend="hidden" type="hyphenInWord"/>ment</hi> invented by the Author.</p>
<p>In THREE PARTS.</p>
<list>
<item>I. The Advantages of this <hi>Way,</hi> and Mischiefs of a contrary Practice propos'd and confirm'd by <hi>Reason</hi> and <hi>Authority.</hi>
</item>...
It also has a "tothereader" section? "preface"? "tableofcontents" (I think we said we didn’t want this)? I guess instead of saying what we do not want, would it be easier to say what we do want?
well I meant for that to be xml but apparently github doesn't sanitize xml code. so it looks like plain text...but the question is still the same.
You have to escape tags
:\<head>\</head>
Also, you can wrap to whole text in triple ` ticks
\```
text
\```
remove the backslashes
gotcha. I am trying to get down to a certain tag in the tree and it is pretty hairy. Might be doing something wrong but I think automating this through x number to texts will significantly slow down the file processing. It is almost like we would have to go to the leaves and enter sentence by sentence.
I am 98 percent sure we can skip the , tags, but I am not sure how right now. To skip them or just to get to them, we have to go pretty far down the tree.
We just have to incorporate regular expressions for each thing to remove.
When we parse plays, we're getting the speaker's name before every line of dialogue (so a very high frequency of abbreviated names), character entrance and exit cues, "act" and "scene" headers, etc. This is probley why we got that large drama cluster in the ECCO test. So we're seeing the document's "format" rather than seeing content, if that's the right way to phrase this. So we're also seeing things like chapter headings in prose works.
How should we address this? Exclude certain tags like "Speaker," "Stage" and "Head?"