NLeSC / litstudy

LitStudy: Using the power of Python to automate scientific literature analysis from the comfort of a Jupyter notebook
https://nlesc.github.io/litstudy/
Apache License 2.0
168 stars 53 forks source link

Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title' #62

Closed SS159 closed 1 year ago

SS159 commented 1 year ago

AttributeError: 'DocumentSet' object has no attribute 'title' is displayed, even after changing title within relevant CSV file (docs_springer) to read 'title'.

Thanks in advance! :)

Sam

AttributeError DocumentSet No Attribute Title
stijnh commented 1 year ago

Thanks for using LitStudy!

Looks like build_corpus expects a DocumentSet and it seems that docs_springer is not a DocumentSet but something else.

Could you maybe provide the rest of the notebook, or do you have the line that creates docs_springer?

SS159 commented 1 year ago

Hi stijnh thanks for the quick response.

Sure, here we go:

image

SS159 commented 1 year ago

I have defined DocumentSet as docs_springer in my case, and it seems to have resolved the error, as the output is no longer an AttributeError, but instead (As below):

image

Does this look correct to you?

stijnh commented 1 year ago

refine_scopus returns two document sets: One for the document found on scopus and one for the documents not found on scopus.

You would need to do something like this:

docs_springer, docs_not_found = litstudy.refine_scopus(docs_springer)
print(len(docs_springer), "papers found on Scopus")
print(len(docs_not_found), "papers NOT found on Scopus")
SS159 commented 1 year ago

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

image

Thanks, as always,

S

SS159 commented 1 year ago

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

S

stijnh commented 1 year ago

Hi,

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

image

Thanks, as always,

S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

SS159 commented 1 year ago

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

image

Thanks again,

Sam

SS159 commented 1 year ago

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

image

Thanks, as always, for your patience and advice,

Sam

stijnh commented 1 year ago

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

image

Thanks again,

Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

image

Thanks, as always, for your patience and advice,

Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

SS159 commented 1 year ago

Hi,

Great, thanks stijnh. Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot? image Thanks, as always, S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)? Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here. Thanks, S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

Thanks for sharing this @stijnh - one (final) question which isn't clear to me from the guidance, how can we change the parameters to search for trigrams? I have a feeling that the top scoring bigram below "nature_solutions" is actually "nature-based solutions" or "nature based solutions", and would like to capture this in the word distribution output.

image

SS159 commented 1 year ago

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety? image Thanks again, Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue: In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'. image Thanks, as always, for your patience and advice, Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

Thanks @stijnh , although I can't seem to get pandas to write the DataFrame to a .csv, here's what I'm doing:

image image image

There's no error returned, but nothing being written to the .csv either...

stijnh commented 1 year ago

There's no error returned, but nothing being written to the .csv either...

Replace

DataFrame = pd.DataFrame()

by

DataFrame = litstudy.compute_word_distribution(corpus).sort_index()

You were creating an empty DataFrame and then calling to_excel on that one.