Text by text keyword extraction in dataset

AhmetCakar commented 3 years ago

First of all thank you for the model. I want to do something like this; For example, there are 20 text data in my dataset. I want to extract the keyword of each text. How can I do that?

andrewtavis commented 3 years ago

Hi there :) If you have the texts in a dataframe as different rows, then it'd be as simple as preparing text corpuses from the given row using the following:

from kwx.utils import prepare_data

input_language = "english" # or your language
row_you_want = an_integer

text_corpus = prepare_data(
    data=your_df.loc[row_you_want],
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

From there you'd follow the following as depicted in the readme (only use the corpus_no_ngrams step if you're using BERT, or else pass text_corpus from above):

from kwx.model import extract_kws

num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]

# Remove n-grams for BERT training
corpus_no_ngrams = [
    " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
    method="BERT", # "BERT", "LDA", "TFIDF", "frequency"
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA
    input_language=input_language,
    output_language=None,  # allows the output to be translated
    num_keywords=num_keywords,
    num_topics=num_topics,
    corpuses_to_compare=None,  # for TFIDF
    ignore_words=ignore_words,
    prompt_remove_words=True,  # check words with user
    show_progress_bar=True,
    batch_size=32,
)

Do that for each of your rows, and you've got the keywords for each of your texts. If you have it in a csv, then pass the path to the csv to kwx.utils.prepare_data, or just load it in to a pandas dataframe before.

Let me know if any of the above is confusing 😊

AhmetCakar commented 3 years ago

I may be asking simple questions, which I am very new to python and nlp. First of all I apologize for that. hata_github

I am getting this error and could not fix the reason.

My second question is,Do we have a chance to automate this? For example, let me give the csv file where there are thousands of tweets. Let Bert analyze it and get the keyword for each tweet.

andrewtavis commented 3 years ago

@AhmetCakar, my mistake on the explanation above. When you subset by the row you're making a series, which then is not being assigned properly :) Thanks for pointing this out! The fix for this is going through in #37, after which I'll update PyPI and then the above should work once you do the following:

pip install kwx -U

It all should work with 0.1.8.1, but if not then let me know :)

As far as your other question: automating this would be as simple as say doing a for loop over the dataframe indexes and then assigning the results to a new list. Pseudocode for that is:

kw_results = []
for i in pd_tweets.index:
    tweet_kws = extract_kws(...)
    kw_results.append(tweet_kws)

With this being said, what you're talking about is kind of overkill. Tweets being 140 characters means that each is only going to have so many words in it. You'd be able to find the "keywords" for the average tweet by just getting rid of the stopwords.

andrewtavis commented 3 years ago

v0.1.8.1 is now live, thus adding the ability to pass series data. If you have further questions, then let me know :)

Thanks for the issue!

AhmetCakar commented 3 years ago

Really, your answers have helped me a lot to understand this subject. Thank you so much. Good work.

andrewtavis commented 3 years ago

@AhmetCakar, my absolute pleasure :) Welcome to Python and NLP!

andrewtavis / kwx

Text by text keyword extraction in dataset #36