archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Update PlainTextExtractor to just extract text #452

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

Currently there is a fair bit of overlap between the PlainTextExtractor and WebPagesExtractor. Really, the only different between them now is the name of the content/text column, and WebPagesExtractor has some additional columns.

I propose that PlainTextExtractor moves to something that is more in the spirit of its name. It should run RemoveHTMLDF, RemoveHTTPHeaderDF, a DataFrame version of ExtractBoilerpipeTextRDD, and output a single column (csv or parquet), or possibly a single text file.

ianmilligan1 commented 4 years ago

Yes, that's a great idea @ruebot - I think that's more in spirit of its name, and you could imagine using it in a pipeline through to text analysis better than WebPagesExtractor.