Dig into this vectorizer to understand well how it is working. Specifically understand, and document, and prepare code examples for each parameter of the vectorizer.
How important to the selected number of max_features?
Could we give a stopword list manually to the vectorizer? i.e. union of accounting specific stop word list and standard English stop wordlist
We are using CountVectorizer from from
sklearn.feature_extraction.text
in Milestone_1_W_Relevant_Data and Milestone_1 like below:max_features
?