Closed AhmetCakar closed 3 years ago
Hi there :) If you have the texts in a dataframe as different rows, then it'd be as simple as preparing text corpuses from the given row using the following:
from kwx.utils import prepare_data
input_language = "english" # or your language
row_you_want = an_integer
text_corpus = prepare_data(
data=your_df.loc[row_you_want],
target_cols="cols_where_texts_are",
input_language=input_language,
min_token_freq=0, # for BERT
min_token_len=0, # for BERT
remove_stopwords=False, # for BERT
verbose=True,
)
From there you'd follow the following as depicted in the readme (only use the corpus_no_ngrams
step if you're using BERT, or else pass text_corpus
from above):
from kwx.model import extract_kws
num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]
# Remove n-grams for BERT training
corpus_no_ngrams = [
" ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]
# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
method="BERT", # "BERT", "LDA", "TFIDF", "frequency"
bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA
input_language=input_language,
output_language=None, # allows the output to be translated
num_keywords=num_keywords,
num_topics=num_topics,
corpuses_to_compare=None, # for TFIDF
ignore_words=ignore_words,
prompt_remove_words=True, # check words with user
show_progress_bar=True,
batch_size=32,
)
Do that for each of your rows, and you've got the keywords for each of your texts. If you have it in a csv, then pass the path to the csv to kwx.utils.prepare_data
, or just load it in to a pandas dataframe before.
Let me know if any of the above is confusing 😊
I may be asking simple questions, which I am very new to python and nlp. First of all I apologize for that.
I am getting this error and could not fix the reason.
My second question is,Do we have a chance to automate this? For example, let me give the csv file where there are thousands of tweets. Let Bert analyze it and get the keyword for each tweet.
@AhmetCakar, my mistake on the explanation above. When you subset by the row you're making a series, which then is not being assigned properly :) Thanks for pointing this out! The fix for this is going through in #37, after which I'll update PyPI and then the above should work once you do the following:
pip install kwx -U
It all should work with 0.1.8.1, but if not then let me know :)
As far as your other question: automating this would be as simple as say doing a for loop over the dataframe indexes and then assigning the results to a new list. Pseudocode for that is:
kw_results = []
for i in pd_tweets.index:
tweet_kws = extract_kws(...)
kw_results.append(tweet_kws)
With this being said, what you're talking about is kind of overkill. Tweets being 140 characters means that each is only going to have so many words in it. You'd be able to find the "keywords" for the average tweet by just getting rid of the stopwords.
v0.1.8.1 is now live, thus adding the ability to pass series data. If you have further questions, then let me know :)
Thanks for the issue!
Really, your answers have helped me a lot to understand this subject. Thank you so much. Good work.
@AhmetCakar, my absolute pleasure :) Welcome to Python and NLP!
First of all thank you for the model. I want to do something like this; For example, there are 20 text data in my dataset. I want to extract the keyword of each text. How can I do that?