NationalLibraryOfNorway / DHLAB

DHLAB is a library of python modules for accessing text and pictures at the National Library of Norway.
https://nationallibraryofnorway.github.io/DHLAB/
MIT License
20 stars 5 forks source link

What is the max limit for search hits? #216

Open Hegghammer opened 1 month ago

Hegghammer commented 1 month ago

When I search for something that (presumably) yields a lot of hits, I get a JSONDecodeError. If I narrow the chronological window, it works. What is the max number of items that the system can return?

This, for example, yields a JSONDecodeError:

import dhlab as dh
kp = dh.Corpus(
  doctype="digavis",
  fulltext="miljøvern",
  from_year="1970",
  to_year="2023",
  limit=10000000,
  )
joncto commented 1 month ago

@Hegghammer, I am sorry that you experienced trouble. The reason is not directly related to a max. limit, but rather the volume of the databases behind which hold the data for "digavis" and "digibok" doctypes. We are looking into improving the backend, but for the moment, a strategy would be to segment several corpora by more limited time periods.

I tested to do this per decade. For example

corpus_70s = dh.Corpus(doctype="digavis", fulltext="miljøvern", from_year=1970, to_year=1979, limit=100000)
corpus_70s

returns a dataframe of 51303 rows, while

corpus_80s = dh.Corpus(doctype="digavis", fulltext="miljøvern", from_year=1980, to_year=1989, limit=100000)
corpus_80s

returns a dataframe of 35253 rows.

(Be aware that the year values in your code example are entered as strings, within quotation marks. Since Python is dynamic, it might work anyway but year values are defined as integers, and should be entered without quotation marks. Documentation of different parameters' data type can be found here: https://dhlab.readthedocs.io/en/stable/apidocs/dhlab/dhlab.text.corpus.html )

Hegghammer commented 1 month ago

Thanks for the quick reply. Segmentation should work in principle, but for some strange reason I now get the error on all searches, including:

I'm confused. Are there rate limits or something else I should be aware of?

Hegghammer commented 1 month ago

Update: today the API works mostly fine for me. I was even able to run the command in the opening post and get back a df with >200k rows in about 15 seconds. It's not clear to me what's going on. Either 1) you fixed it, 2) something was up with my system yesterday, or 3) the service is unstable.

My config for what it's worth: