Missing manuscripts? - Githubissues

arbeitsgruppe-digitale-altnordistik / Sammlung-Toole

A new look on Handrit.is data

https://arbeitsgruppe-digitale-altnordistik.github.io/Sammlung-Toole/

MIT License

0 stars 0 forks source link

Missing manuscripts? #47

Closed MaditaVerena closed 3 years ago

MaditaVerena commented 3 years ago

I tried to process a search result of 1380 manuscripts (manuscripts, dated 1400-1600: https://handrit.is/is/search/results/FNw6MP ) for their contents but I got only 540 manuscripts (before cleaning!). Same happened while processing the search result for its meta data. Then I got only 304 manuscripts. Mysteriös! No error message was displayed.

BalduinLandolt commented 3 years ago

on the server? or locally? (if so, which branch?)

MaditaVerena commented 3 years ago

Server! Sorry for being vague. :)

BalduinLandolt commented 3 years ago

no problem! How pressingly do you need it fixed? Just in terms of deciding my priorities...

MaditaVerena commented 3 years ago

Not super urgent. Maybe I can help fixing it in some weeks because I sense it might be my code, which is not working properly.

BalduinLandolt commented 3 years ago

Feel free! :)
On the server we have the stable branch running. If you can track down the issue, that's all the better

kraus-s commented 3 years ago

As far as I can tell from the logs, it is getting all 25 shelfmarks per page and going over all 56 pages of results. I am assuming that there a a lot of multiple hits (hits for each language are listed individually). At first glance I see double hits for almost every signature. These are being "flattened" already before you clean the data manually. So basically.. its.. a feature? :D Should we include a small infobox with the result that makes this transparent? Could contain No. of total hits, no. of hits for each language and no. of hits dropped for that reason.

kraus-s commented 3 years ago

Hmmmm, this kept bugging me, so I investigated further. Turns out, my previous hunch was completely wrong and I misread the logs. Long story short: Issue is with handrit. As you can see in the image below, our function only "reads" the result sub-pages visible in the list at the top of the result pages. All result pages in between are omitted. To get these, we would have to call the `get_serach_result_pages()´ on every x result pages.

I can try and make a fix for this and see if we can backport it to stable.

kraus-s commented 3 years ago

Will roll out fix to stable and server in a day or two.