NLP Notes - Githubissues

coatk1 commented 1 year ago

Intro to NLP

nltk
gensim
defaultdict
itertools.chain.from_iterable()
tf-idf
- Term frequency = percentage share of the word compared to all tokens in the document
- Inverse document frequency = logarithm of the total number of documents in a corpora divided by the number of documents containing the term
spacy
polyglot
scikit-learn
- CountVectorizer for preprocessing
- Error with data in test that is not in training data. Add more training data or remove words from test that isnt in training data.
Naive Bayes Classifier
Sentiment Analysis?

# Create the list of alphas: alphas
alphas = np.arange(0, 1, 0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

coatk1 commented 1 year ago

Sentiment Analysis

textblob
wordcloud (uses matplotlib)
Bag of Words
- CountVectorizer() (is sparse matrix convert to .toarray())
n-grams: a sequence of tokens (for additional context)
from langdetect import detect_langs
Stopwords
- from wordcloud import WordCloud, STOPWORDS
- from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
Stemming vs. lemmatization
- from nltk.stem import PorterStemmer
- from nltk.stem import WordNetLemmatizer
Regression
- Linear: numeric outcome
- Logistic: probability
- from sklearn.linear_model import LogisticRegression

coatk1 commented 1 year ago

Advanced NLP with SpaCy

Word vectors and similarity
Semi-automate labeling with spaCy's Matcher

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"OP": "?", "IS_DIGIT": True}]

# Add patterns to the matcher
matcher.add('GADGET', None, pattern1, pattern2)

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Find the matches in the doc
    matches = matcher(doc)

    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(start, end, 'GADGET') for match_id, start, end in matches]
    print(doc.text, entities)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]

    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep='\n')

coatk1 / datacamp

NLP Notes #40

Intro to NLP

Sentiment Analysis

Advanced NLP with SpaCy