ArchiveLabs / iiif.archivelab.org

Internet Archive IIIF Image 2.0 Server
GNU General Public License v3.0
30 stars 13 forks source link

Mapping fulltext to book images via annotations #29

Open mekarpeles opened 6 years ago

mekarpeles commented 6 years ago

For a public/unrestricted book (e.g. https://archive.org/details/TheGeometry) one can get the fulltext for each page (with word regions) via the following API:

https://api.archivelab.org/books/<identifier>/pages/<page#>/ocr?mode=words

e.g. https://api.archivelab.org/books/TheGeometry/pages/10/ocr?mode=words

One can also get the results by paragraph by removing ?mode=words

cc: @num170r

jcmundy commented 5 years ago

Thank you for providing this! I see five numbers for each word when I follow your link. I am used to seeing x, y, w, h. What is the fifth number?

mekarpeles commented 5 years ago

Not sure! @rchrd2 ?

rchrd2 commented 5 years ago

Unfortunately, I don't know either. I haven't modified the seach highlighting code. You may need to reverse engineer it a bit using a production book.

The code that processes the search results (using the archive.org api, not the archivelabs one) is here https://github.com/internetarchive/bookreader/blob/master/BookReader/plugins/plugin.search.js#L206

amandelman commented 4 years ago

Does this issue also cover indexing the annotations to make them available in IIIF search?

mekarpeles commented 4 years ago

Nope -- we expose raw (e.g. OCR) data but don't map it via any search API. Feel free to extend the current service to achieve this.

We do / did have an experimental annotations service: https://pragma.archivelab.org/ https://github.com/archivelabs/pragma.archivelab.org

But I'm not sure if it's still working.

Here is a demo of when it worked: https://www.youtube.com/watch?v=FtcajyRQnqM

amandelman commented 4 years ago

Awesome. We'll add this to our backlog now that we have a little more clarity on the issue. Thank you!

hadro commented 1 year ago

Related to IIIF v3 rewrite underway and specifically #80