dimitrismistriotis / alt-profanity-check

A fast, robust library to check for offensive language in strings, dropdown replacement of "profanity-check".
https://pypi.org/project/alt-profanity-check/
MIT License
69 stars 16 forks source link

Compress training data #25

Closed dimitrismistriotis closed 1 year ago

dimitrismistriotis commented 1 year ago

On writing these lines from https://pypistats.org/packages/alt-profanity-check:

Downloads last day: 828
Downloads last week: 7,826
Downloads last month: 187,159

We have seen more extending the 200K downloads per month. Additionally, there is a fork which has removed the data probably to install the library faster.

It is a requirement and the reason that this project exists that data should be "as close as possible" to the models being trained. On the other hand 99.99% of people using the library want to ... use the library and are not interested into having access to the data. For them the download of an approximately 60MB file is not helpful.

With that we (me + @menkotoglou) discussed two possibilities: (a) move the data to another repository, (b) compress them. Opted for (b)

Last step was the choice of the algorithm, long story short decided on 7z, knowing that some systems ship without it installed:

image

ls -lh clean_data.*
-rw-rw-r-- 1 dimitrios dimitrios 64M Jun 13 09:16 clean_data.csv
-rw-rw-r-- 1 dimitrios dimitrios 18M Jun 13 09:40 clean_data.tar.xz
-rw-rw-r-- 1 dimitrios dimitrios 26M Jun 13 09:42 clean_data.zip