Currently there is a fair bit of overlap between the PlainTextExtractor and WebPagesExtractor. Really, the only different between them now is the name of the content/text column, and WebPagesExtractor has some additional columns.
I propose that PlainTextExtractor moves to something that is more in the spirit of its name. It should run RemoveHTMLDF, RemoveHTTPHeaderDF, a DataFrame version of ExtractBoilerpipeTextRDD, and output a single column (csv or parquet), or possibly a single text file.
Yes, that's a great idea @ruebot - I think that's more in spirit of its name, and you could imagine using it in a pipeline through to text analysis better than WebPagesExtractor.
Currently there is a fair bit of overlap between the
PlainTextExtractor
andWebPagesExtractor
. Really, the only different between them now is the name of the content/text column, andWebPagesExtractor
has some additional columns.I propose that
PlainTextExtractor
moves to something that is more in the spirit of its name. It should runRemoveHTMLDF
,RemoveHTTPHeaderDF
, a DataFrame version ofExtractBoilerpipeTextRDD
, and output a single column (csv or parquet), or possibly a single text file.