Desbordante / desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
GNU Affero General Public License v3.0
371 stars 66 forks source link

10 simple stats for strings #403

Closed Naniduan closed 4 months ago

Naniduan commented 4 months ago

The 10 statistics are:

  1. words - unique words in the column
  2. topKChars - k top used symbols
  3. topKWords - k top used words
  4. minWords - the lowest amount of words in the column
  5. maxWords - the highest amount of words in the column
  6. wordCount - total amount of words in the column
  7. minChars - the lowest amount of symbols in the column
  8. maxChars - the highest amount of symbols in the column
  9. entirelyLowercaseCount - the amount of completely lowercase words
  10. entirelyUppercaseCount - the amount of completely uppercase words

Also, for these 10 statistics were made: python bindings, tests and a python example

vs9h commented 4 months ago

1) We use english in PRs (for description/conversations) 2) Rewrite commit history: new commits should not fix previous ones (within one pull request), the names should be meaningful

I didn’t review the code, I just pointed out those comments that should be fixed first.