Closed SS159 closed 1 year ago
Thanks for using LitStudy!
Looks like build_corpus
expects a DocumentSet
and it seems that docs_springer
is not a DocumentSet
but something else.
Could you maybe provide the rest of the notebook, or do you have the line that creates docs_springer
?
Hi stijnh thanks for the quick response.
Sure, here we go:
I have defined DocumentSet as docs_springer in my case, and it seems to have resolved the error, as the output is no longer an AttributeError, but instead (As below):
Does this look correct to you?
refine_scopus
returns two document sets: One for the document found on scopus and one for the documents not found on scopus.
You would need to do something like this:
docs_springer, docs_not_found = litstudy.refine_scopus(docs_springer)
print(len(docs_springer), "papers found on Scopus")
print(len(docs_not_found), "papers NOT found on Scopus")
Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?
Thanks, as always,
S
@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
S
Hi,
Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?
Thanks, as always,
S
This is the complete table of all ngrams, that means all the words that contain a _
after processing (that is what the .filter(like="_")
does).
Remove .filter(...)
part to see a list of the complete word distribution.
@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
S
The parameter ngram_threshold
determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").
The actual processing is done by gensim
, here is the documentation and look at the threshold
parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases
Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?
Thanks again,
Sam
Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.
Thanks, as always, for your patience and advice,
Sam
Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?
Thanks again,
Sam
The thing returned by compute_word_distribution
is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html
For example, you can add ...sort_index().to_csv("word_distrbution.csv")
Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.
Thanks, as always, for your patience and advice,
Sam
Not sure about this one. Maybe sometimes nature
is followed by solutions
and it is interpreted as the bigram nature_solutions
. You can disable bigram detection by removing the ngram_threshold=
options from build_corpus
.
Good luck!
Hi,
Great, thanks stijnh. Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot? Thanks, as always, S
This is the complete table of all ngrams, that means all the words that contain a
_
after processing (that is what the.filter(like="_")
does).Remove
.filter(...)
part to see a list of the complete word distribution.@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)? Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here. Thanks, S
The parameter
ngram_threshold
determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").The actual processing is done by
gensim
, here is the documentation and look at thethreshold
parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases
Thanks for sharing this @stijnh - one (final) question which isn't clear to me from the guidance, how can we change the parameters to search for trigrams? I have a feeling that the top scoring bigram below "nature_solutions" is actually "nature-based solutions" or "nature based solutions", and would like to capture this in the word distribution output.
Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety? Thanks again, Sam
The thing returned by
compute_word_distribution
is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.htmlFor example, you can add
...sort_index().to_csv("word_distrbution.csv")
Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue: In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'. Thanks, as always, for your patience and advice, Sam
Not sure about this one. Maybe sometimes
nature
is followed bysolutions
and it is interpreted as the bigramnature_solutions
. You can disable bigram detection by removing thengram_threshold=
options frombuild_corpus
.Good luck!
Thanks @stijnh , although I can't seem to get pandas to write the DataFrame to a .csv, here's what I'm doing:
There's no error returned, but nothing being written to the .csv either...
There's no error returned, but nothing being written to the .csv either...
Replace
DataFrame = pd.DataFrame()
by
DataFrame = litstudy.compute_word_distribution(corpus).sort_index()
You were creating an empty DataFrame
and then calling to_excel
on that one.
AttributeError: 'DocumentSet' object has no attribute 'title' is displayed, even after changing title within relevant CSV file (docs_springer) to read 'title'.
Thanks in advance! :)
Sam