DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Export Compressed Posts #91

Open bbengfort opened 7 years ago

bbengfort commented 7 years ago

According to these numbers:

http://bbengfort.github.io/observations/2017/06/07/compression-benchmarks.html

We can achieve much better export results if we gzip each file individually as we export them. This should help our export and admin process a great deal.

Python3 offers support for gzip: https://docs.python.org/3/library/gzip.html and I'd like to implement this in the export process.

bbengfort commented 7 years ago

@ojedatony1616 @rebeccabilbro maybe a good idea to include for the book as well.