Jace84 / congressional-record

A parser for the Congressional Record.
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

ASCII Control Characters in XML, a few other random issues #8

Closed njkelly closed 8 years ago

njkelly commented 8 years ago

There are numerous ASCII control characters scattered about the speeches. I just deleted these using this code:

gsub(pattern="[[:cntrl:]]|</a>|</?sp |<greek-[[:alpha:]]>|\\[<<gosudarevoye delo>>\\]", replacement="", newtext)

This code also gets rid of a few greek letters that are read as tags in the xml (incorrectly), as well as a random tag that showed up just once. All of these were deleted, but I suppose if there is a way to accurately capture their meaning replacing with appropriate characters would be better.

njkelly commented 8 years ago

This appears to remain unresolved