buda-base / public-digital-library

http://library.bdrc.io
4 stars 6 forks source link

new endpoint for chunks #911

Open eroux opened 3 weeks ago

eroux commented 3 weeks ago

Since chunks are not in Fuseki anymore, I replaced the Chunks SPARQL request with a request on OpenSearch that return the same type of results but in a different format.

Here's an example:

https://ldspdi-dev.bdrc.io/osearch/etextchunks?id=bdr:UTIE0OPIF851CE56_I1CZ2444&cstart=1000&cend=1050

Note that it works for both etexts and volumes (using the same id argument).

I'll also change the download endpoint to use OpenSearch, but for this one there shouldn't be any change necessary in the client

berger-n commented 2 weeks ago

done: https://library-dev.bdrc.io/show/bdr:IE0OPIC2BFA6FE

image

berger-n commented 2 weeks ago

struggling a bit to understand what's happening here: https://library-dev.bdrc.io/show/bdr:IE0OPI11B27745


note that the UI clearly needs some refinements (hidden sub items when loading page with an open text, items not closing, etc.)

[edit: oh I see the query must use the sliceBegin/EndChar parameters, for example this works and this doesn't ]

image

berger-n commented 2 weeks ago

not sure where to report that but here's a case where something looks odd with the new etextrefs: https://ldspdi-dev.bdrc.io/query/graph/etextrefs?R_RES=bdr%3AIE0OPI1C1BBFCB (same volume/prefLabel/start/end but seqNum different)

image


same here: https://ldspdi-dev.bdrc.io/query/graph/etextrefs?R_RES=bdr%3AIE0OPI11B27745 (BTW it seems the volume 78 above is just taken out from this one?)

image image


you can see both of them from here: https://library-dev.bdrc.io/show/bdr:MW3CN3408

image

image


apart from that everything looks fine to me on these:

so @JannTibetan please feel free to let me know if you spot anything odd navigating through these etexts or others!


one final comment for @eroux seem to have spotted a case where etextchunks doesn't return a whole page, only the second chunk it overlaps with: https://ldspdi-dev.bdrc.io/osearch/etextchunks?cstart=4889&cend=14889&id=bdr%3AVLIE0OPI11B27745_I3CN4393

see https://library-dev.bdrc.io/show/bdr:IE0OPI11B27745?startChar=470&&openEtext=bdr:VLIE0OPI11B27745_I3CN4393#open-viewer:

simplescreenrecorder-2024-08-30_18 37 53 mkv

JannTibetan commented 2 weeks ago

https://github.com/user-attachments/assets/6257701e-9d78-4b15-9d32-8cd4b8983032

The search box isn't working for me

JannTibetan commented 2 weeks ago
Screenshot 2024-08-30 at 1 20 28 PM

This particular etext is displaying incorrectly https://library-dev.bdrc.io/show/bdr:IE1PD105899?openEtext=bdr:UT1PD105899_011_0000#open-viewer

berger-n commented 1 week ago

thanks! outline search in https://library-dev.bdrc.io/show/bdr:MW3CN3408 should be fine now:

image

no clue regarding the encoding issue in https://library-dev.bdrc.io/show/bdr:IE1PD105899?openEtext=bdr:UT1PD105899_011_0000#open-viewer ... (@eroux wdyt?)

eroux commented 1 week ago

I think it's just what the output of the OCR is, but I'll check tomorrow, is it the same on library.bdrc.io ?

Le lun. 2 sept. 2024 à 14:59, Nicolas Berger @.***> a écrit :

thanks! outline search in https://library-dev.bdrc.io/show/bdr:MW3CN3408 should be fine now:

image.png (view on web) https://github.com/user-attachments/assets/d8f335e0-8eea-42cb-8c1e-abd7a7a7430d

no clue regarding the encoding issue in https://library-dev.bdrc.io/show/bdr:IE1PD105899?openEtext=bdr:UT1PD105899_011_0000#open-viewer ... @.*** https://github.com/eroux wdyt?)

— Reply to this email directly, view it on GitHub https://github.com/buda-base/public-digital-library/issues/911#issuecomment-2324704863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAO3RDCICCSECFQGTBOAA3ZUROLVAVCNFSM6AAAAABNACZGDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRUG4YDIOBWGM . You are receiving this because you were mentioned.Message ID: @.***>

berger-n commented 1 week ago

ok thanks! I don't have access on https://library.bdrc.io/show/bdr:UT1PD105899_011_0000

eroux commented 1 week ago

I confirm it's an issue with the etext itself, not a bug in the frontend. I'll look at the other issues now

eroux commented 1 week ago

looking at

seem to have spotted a case where etextchunks doesn't return a whole page, only the second chunk it overlaps with: https://ldspdi-dev.bdrc.io/osearch/etextchunks?cstart=4889&cend=14889&id=bdr%3AVLIE0OPI11B27745_I3CN4393

I'm not quite sure what you mean? it's a case where the chunks for this character range are in two different etext documents but they all seem to be there, at least at first glance... @berger-n can you tell me more?

eroux commented 1 week ago

about

not sure where to report that but here's a case where something looks odd with the new etextrefs: https://ldspdi-dev.bdrc.io/query/graph/etextrefs?R_RES=bdr%3AIE0OPI1C1BBFCB (same volume/prefLabel/start/end but seqNum different)

this is a bug in the outline. It's annoying but nothing to change in the frontend I would say

berger-n commented 1 week ago

sure @eroux! first here's the working case, before the bug occurs: https://library-dev.bdrc.io/show/bdr:IE0OPI11B27745?startChar=470&openEtext=bdr:VLIE0OPI11B27745_I3CN4393#open-viewer

image

then click on on Last page, the query results begin at the last half of the page: https://library-dev.bdrc.io/show/bdr:IE0OPI11B27745?startChar=4889&&openEtext=bdr:VLIE0OPI11B27745_I3CN4393#open-viewer

image

it seems the right page is returned but only the second chunk is:

image

image

eroux commented 1 week ago

hmmm ok I'm not quite sure... I think there's different angles here.

First about the query results, in the query that you make on https://ldspdi-dev.bdrc.io/osearch/etextchunks?cstart=4889&cend=14889&id=bdr%3AVLIE0OPI11B27745_I3CN4393 you ask the API for pages and chunks between character 4889 and 14889 in VLIE0OPI11B27745_I3CN4393. When looking at the results, the results seem ok... if not, what is the page number of the missing page or the character range of the missing chunk?

But I think this is probably not the real issue. I think the problem is when the frontend constructs the query for the last page, and I'm not quite sure how it does that, can you tell me more? The first intuition for me would be to look at the sliceEndChar of the etext or volume and ask the API for information about the character range (sliceEndChar-10000) to sliceEndChar. Is it what the JS code is doing?

berger-n commented 1 week ago

thanks @eroux, indeed! it works now: https://library-dev.bdrc.io/show/bdr:IE0OPI11B27745?startChar=470&openEtext=bdr%3AVLIE0OPI11B27745_I3CN4393#open-viewer

simplescreenrecorder-2024-09-03_13 31 47 mkv

eroux commented 1 week ago

this is great, thanks! I really love the next etext viewer!