NLeSC / litstudy

LitStudy: Using the power of Python to automate scientific literature analysis from the comfort of a Jupyter notebook
https://nlesc.github.io/litstudy/
Apache License 2.0
155 stars 50 forks source link

Scopus400Error: Error translating query - Refining results with "source title" query argument #89

Open FabianEUR opened 6 months ago

FabianEUR commented 6 months ago

Hi,

Is it possible to refine/process publications from Scopus limited to source titles containing a specified keyword? For example, my query (and variations thereof) gives me the above error after refining a 2000+- publications:

( TITLE-ABS-KEY ( "recommend sys" OR "recommend servi" ) AND SRCTITLE ( "comput*" OR "acm" ) )

I had a look at the API reference and the existing issues, but I had some trouble finding an answer to my question.

Thank you

stijnh commented 6 months ago

Thanks for using litstudy!

I cannot see the error. The query looks fine. Does the query work if you use it on the Scopus website?

FabianEUR commented 6 months ago

The query works on Scopus and I can find publications which can be exported. I've tried variations without quotation marks, with/without brackets, only one keyword, with/out wildcard, etc. but get the same error.

Here is the error:

Scopus400Error                            Traceback (most recent call last)
Cell In[2], line 7
      4 import logging
      5 logging.getLogger().setLevel(logging.CRITICAL)
----> 7 docs_scopus, docs_not_found = litstudy.refine_scopus(docs_scopus)
      8 print(len(docs_scopus), "papers found on Scopus")
      9 print(len(docs_not_found), "papers NOT found on Scopus")

File ~\AppData\Roaming\Python\Python311\site-packages\litstudy\sources\scopus.py:248, in refine_scopus(docs, search_title)
    244                     return ScopusDocument.from_eid(record.eid)
    246     return None
--> 248 return docs._refine_docs(callback)

File ~\AppData\Roaming\Python\Python311\site-packages\litstudy\types.py:53, in DocumentSet._refine_docs(self, callback)
     50 old_docs = []
     52 for i, doc in enumerate(progress_bar(self.docs)):
---> 53     new_doc = callback(doc)
     55     if new_doc is not None:
     56         new_indices.append(i)

File ~\AppData\Roaming\Python\Python311\site-packages\litstudy\sources\scopus.py:236, in refine_scopus.<locals>.callback(doc)
    234 if len(title) > 10 and search_title:
    235     query = f"TITLE({title})"
--> 236     response = ScopusSearch(query, view="STANDARD", download=False)
    237     nresults = response.get_results_size()
    239     if nresults > 0 and nresults < 10:

File ~\AppData\Roaming\Python\Python311\site-packages\pybliometrics\scopus\scopus_search.py:206, in ScopusSearch.__init__(self, query, refresh, view, verbose, download, integrity_fields, integrity_action, subscriber, **kwds)
    204 self._query = query
    205 self._view = view
--> 206 Search.__init__(self, query=query, api='ScopusSearch', count=count,
    207                 cursor=subscriber, download=download,
    208                 verbose=verbose, **kwds)

File ~\AppData\Roaming\Python\Python311\site-packages\pybliometrics\scopus\superclasses\search.py:62, in Search.__init__(self, query, api, count, cursor, download, verbose, **kwds)
     59 self._cache_file_path = get_folder(api, self._view)/stem
     61 # Init
---> 62 Base.__init__(self, params=params, url=URLS[api], download=download,
     63               api=api, verbose=verbose)

File ~\AppData\Roaming\Python\Python311\site-packages\pybliometrics\scopus\superclasses\base.py:66, in Base.__init__(self, params, url, api, download, verbose, *args, **kwds)
     64         self._json = loads(fname.read_text())
     65 else:
---> 66     resp = get_content(url, api, params, *args, **kwds)
     67     header = resp.headers
     69     if ab_ref_retrieval:

File ~\AppData\Roaming\Python\Python311\site-packages\pybliometrics\scopus\utils\get_content.py:116, in get_content(url, api, params, **kwds)
    114         except:
    115             reason = ""
--> 116     raise errors[resp.status_code](reason)
    117 except KeyError:
    118     resp.raise_for_status()

Scopus400Error: Error translating query

--

I'm using jupyter notebook and have the same error via uni VPN and on campus.

stijnh commented 6 months ago

Seems that this is a bug. It seems that litstudy tries to search Scopus for the title of the paper by using the query "TITLE({title})", but this results in an incorrect syntax for Scopus for certain titles. This will need further investigation.

However, I don't really understand the line litstudy.refine_scopus(docs_scopus). You have loaded documents from Scopus into docs_scopus and then want to refine them again using Scopus? Or do you load the original documents from a file?

FabianEUR commented 6 months ago

Ahh, maybe that explains why the refining always only works until a certain publication before the error appears.

I exported the .csv from scopus and loaded the file into docs_scopus and then refined them. Is this only meant to be done for non-scopus datasets?

stijnh commented 6 months ago

Ahh, maybe that explains why the refining always only works until a certain publication before the error appears.

Indeed. If you could figure out which publication it fails on, you can remove that one from the dataset as a temporary solution.

I exported the .csv from scopus and loaded the file into docs_scopus and then refined them. Is this only meant to be done for non-scopus datasets?

That is fine, if you load it from a CSV file it indeed makes sense to refine it afterwards. The function refine_scopus should work on any dataset from any source. It fails here because of a bug :-(

stijnh commented 6 months ago

If you would like to look into this issue, we are happy to accept pull requests!

I think what need to happen is probably that the title needs to be "stripped" from punctuation before it is passed to Scopus. For example, if the title is something like:

Research on the number of prime numbers between n² and (n+1)²

The query sent to Scopus will be:

TITLE(Research on the number of prime numbers between n² and (n+1)²)

but all those non-alphabetic characters result in query that is not accepted by Scopus.

Additionally, in the case were a Document already has a ScopusID, we can just query Scopus directly for the publication without having to search based on the title (I think the CSV file already provides the ScopusID).