This code also gets rid of a few greek letters that are read as tags in the xml (incorrectly), as well as a random tag that showed up just once. All of these were deleted, but I suppose if there is a way to accurately capture their meaning replacing with appropriate characters would be better.
There are numerous ASCII control characters scattered about the speeches. I just deleted these using this code:
gsub(pattern="[[:cntrl:]]|</a>|</?sp |<greek-[[:alpha:]]>|\\[<<gosudarevoye delo>>\\]", replacement="", newtext)
This code also gets rid of a few greek letters that are read as tags in the xml (incorrectly), as well as a random tag that showed up just once. All of these were deleted, but I suppose if there is a way to accurately capture their meaning replacing with appropriate characters would be better.