alleonhardt / clas-digital

C++ implementation of a ancient literature database
https://www.clas-digital.uni-frankfurt.de
1 stars 1 forks source link

Collection: clarify/check/import/obtain remaining BSB scans #153

Open sladen opened 4 years ago

sladen commented 4 years ago

Mainly books in the curated catalogue have an entry for:

{
  "libraryCatalog":"https://opacplus.bsb-muenchen.de/metaopac/singleHit.do?…",
  "archiveLocation":"https://opacplus.bsb-muenchen.de/search?id=…",
}

Only half of these appear in the database in fully-scanned/searchable form. Many entries instead have tags applied, in the form of:

[
  {"tag":"BSBUngeprueft"},
  {"tag":"NichtInBSB"},
  {"tag":"GibtEsBeiBSB"},
  {"tag":"Problemfall"},
  {"tag":"BSBDownloadFertig"},
  {"tag":"anders"}
]

Ideally, this should be clarified—needs proper analysis by reading the JSON, verses a quick grep.

ekoehring commented 4 years ago

According to the Instructions provided bei Stefan (Wü), Hiwis were asked to, after downloading from BSB, remove "GibtEsBeiBSB" and add "BSBDownloadFertig".

So books with "BSBDownloadFertig" should have OCR already, "GibtEsBeiBSB" not.

georgbuechner commented 4 years ago

So, I did a small check on all of this. I checked the following cases and got following feedback:

I hope this was the information that was requested by you two. The first part "inBSB_noOCR" is from a previous issue. I will upload the files here. They are also automatically created with every server-start and can be found at /etc/clas-digital-devel/[filename].txt (filename is the first element in the list above, f.e. "inBSB_noOCR")

georgbuechner commented 4 years ago

BSBDownLoadFertig_noOCR.txt -> 0 books gibtEsBeiBSB_noOCR.txt -> about 145 books gibtEsBeiBSB_OCR.txt -> about 3 books inBSB_noOcr.txt -> about 315

I think on average this is all expected behavior, apart from the three books in the category "gibtEsBeiBSB_OCR" here probably the tag need to be changed in ocr.

All files present the zotero key right away for easy double checking, but also author, title and year.

ekoehring commented 4 years ago

Took care of gibtEsBeiBSB_OCR.txt Please leave this issue open as PaulM and I will be working on gibtEsBeiBSB_noOCR.txt

ekoehring commented 4 years ago

Can I please have current versions of

georgbuechner commented 4 years ago

This is what we have right now: gibtEsBeiBSB_noOCR.txt inBSB_noOcr.txt

ekoehring commented 4 years ago

Thank you, but these are not correct/up to date. I just randomly checked three cases from gibtEsBeiBSB_noOCR.txt which I remember having uploaded already, and indeed, they have scans on the server! (UR2BG6KI, JUUSJMEX, QKUVL5X8). There were about 145 cases, now there are about 175... We did add more entries, but also uploaded scans for them.

georgbuechner commented 4 years ago

Tut mir Leid, ich habe dir die daten, vom developmentserver gegeben und die sind natürlich falsch. Hier sind die neuen Daten: gibtEsBeiBSB_noOCR.txt inBSB_noOcr.txt

ekoehring commented 4 years ago

Wir hätten gerne eine Liste aller Bücher aus der Collection "Geschichte des Tierwissens" die keinen Scan/OCR auf dem Server haben. Informationen und Format wie oben.

georgbuechner commented 4 years ago

Give us a few days for this. It's not hard to implement, but I guess Alex and I will provide a small patch for clas-digital and I will integrate this into the patch. I guess maybe I can provide the list by Thursday morning. If you need it earlier, please let me know.

ekoehring commented 4 years ago

Thursday morning is ok!

georgbuechner commented 4 years ago

We not just get the list through the catalogue?

georgbuechner commented 4 years ago

sorry, this is not possible yet, see #255

ekoehring commented 4 years ago

Ok, then we will work with the two lists provided above, that should give you until next week before we "run dry" again.

ekoehring commented 4 years ago

As #255 has been closed - is it possible now to get the list?

georgbuechner commented 4 years ago

Oh, there was a missunderstanding I think. I though it would just be possible to look in the catalogue, where you can immediately see all books without ocr. But if this is not, what you need, then I will create the list today and send it to you right away

georgbuechner commented 4 years ago

Because the catalogue is exactly that list: https://www.clas-digital.uni-frankfurt.de:9991/catalogue/collections/RFWJC42V/

ekoehring commented 4 years ago

Yes, I know, but I need a list where I can also make notes and save it etc. (from the production server, please)

Diese Listen helfen den beiden Inhalts-Hiwis und mir bei der Datenbeschaffung. Wenn man zB etwas bei der BSB bestellt, kann man dort notieren, welche BSB-ID die Bestellung hat, um eine Woche später, wenn die BSB liefert, die Bestellung einer Zotero-ID zuordnen zu können; um zu notieren, wenn irgendwo Probleme auftreten, etc.

Die Listen, die wir vorher hatten (gibtEsBeiBSB_noOCR.txt) hatten genau das richtige Format. Wir brauchen jetzt nur im nächsten Schritt eine Liste der Bücher in der Sammlung ohne OCR, unabhängig von tags etc., weil wir die ersten Listen durchgearbeitet haben. Es geht also nicht um Features für uns, sondern Hilfestellungen: Klar kann ich die Liste in ein Dokument Copypasten, alle Zeilen mit OCR löschen etc. - aber ich dachte, das ginge auch anders/schneller.

georgbuechner commented 4 years ago

Klar, du bekommst die Liste vermutlich heute Abend!

georgbuechner commented 4 years ago

Functionality is now implemented.

georgbuechner commented 4 years ago

Sorry for taking so long. Here we finally are.

tierwissen_noOCR.txt