Update PlainTextExtractor to just extract text

Currently there is a fair bit of overlap between the PlainTextExtractor and WebPagesExtractor. Really, the only different between them now is the name of the content/text column, and WebPagesExtractor has some additional columns.

I propose that PlainTextExtractor moves to something that is more in the spirit of its name. It should run RemoveHTMLDF, RemoveHTTPHeaderDF, a DataFrame version of ExtractBoilerpipeTextRDD, and output a single column (csv or parquet), or possibly a single text file.

archivesunleashed / aut

Update PlainTextExtractor to just extract text #452