[Bug Report] "min_df" parameter value incorrect for CountVectorizer. Causes low vocab size; and training job fails.

aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

Apache License 2.0

9.98k stars 6.74k forks source link

Link to the notebook

https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/ntm_20newsgroups_topic_modeling/ntm_20newsgroups_topic_model.ipynb

Specific line with bug:

vectorizer = CountVectorizer(input='content', analyzer='word', stop_words='english', tokenizer=LemmaTokenizer(), max_features=vocab_size, max_df=0.95, min_df=0.2)

Describe the bug

The min_df parameter value is incorrect.

The current value (0.2) which was updated in the last commit of Oct 30,2020 results in vocab size: 6 -- which is very low.

This will further cause the training job to fail as vocab size is passed as 2000 there which is correct.

The correct value should be min_df=2.

see : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

min_df float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

To Reproduce

Run the notebook on Sagemaker Studio with Python3 (Data Science) kernel.

I was able to reproduce this, obtaining a vocab size of 6 words with min_df=0.2 but the full vocab size of 2000 words with min_df=2. The sklearn doc explains that a float value represents a proportion of documents, while an int value represents an absolute count. I agree that this is a bug, and having a max_df=0.95 and min_df=2 is not contradictory. The max document frequency is 95% of documents, and min document frequency is 2 documents.

Also see when this was changed: https://github.com/aws/amazon-sagemaker-examples/pull/1601, and a comment that raises the same issue, https://github.com/aws/amazon-sagemaker-examples/commit/bc707152691a58fa219e93e2fea89e5e88ca0974#commitcomment-48304957.

aws / amazon-sagemaker-examples

[Bug Report] "min_df" parameter value incorrect for CountVectorizer. Causes low vocab size; and training job fails. #2107