buda-base / public-digital-library

http://library.bdrc.io
5 stars 6 forks source link

shift in the IA links #836

Open eroux opened 1 year ago

eroux commented 1 year ago

A user reported:

http://purl.bdrc.io/resource/MW20821 - the links to the internet archive seems to have an off by one in the volume numbers.

I know the archive numbering scheme is a bit confusing (afaict, they tend to use no syffix for the first volume, -1 for the second, etc), so the link for, say

http://purl.bdrc.io/resource/MW20821_A912F4

that is given as

https://archive.org/details/bdrc-W20821/bdrc-W20821-24/page/n50/

seems to make sense, b/c it's in the volume 25 (so one would expect -24 as the archive's suffix). However, the correct link is actually

https://archive.org/details/bdrc-W20821/bdrc-W20821-25/page/n50/mode/2up

(that will be redirected for .../n49/ for the 2up spread with page 50 on the right).

I haven't debugged where the numbering got skewed...

Thanks!

berger-n commented 1 year ago

not sure what's happening here... having a look back at https://github.com/buda-base/public-digital-library/issues/684 everything still looks fine with https://library.bdrc.io/show/bdr:MW1PD95844?part=bdr:MW1PD95844_0243 (volume 111, same on IA) any idea?

berger-n commented 1 year ago

there seems to be something wrong with volumes on IA:

and so on, then volume 18 is missing:

then the last available volume is:

eroux commented 1 year ago

Hmmm, I think I have a guess as to what happened, I think during upload to IA the volume RIDs were sorted in alphabetical order, but it's a case where the RID alphabetical order is not the volume order:

It was uploaded a long time ago, perhaps that was a bug in the upload script back then... wdyt @jimk-bdrc ? if that's correct, what is a date after which uploads to IA don't have that issue? We probably should reupload those with potential problems (probably not a lot)

jimk-bdrc commented 1 year ago

@eroux AO never ordered anything in an IA upload - our processes only upload a single archive file.

I first suspected that the marc-W20821.xml (created from https://purl.bdrc.io/resource by archive-ops/scripts/ia/create-archive-metadata.sh) would contain the Image group to volume sequence map, but it does not.

The only ordering indication I find is in Archive0/21/W20821/meta/W20821.xml and I'm not entirely sure what it means. That file was generated by the older https://legacy.tbrc.org/xmldoc?rid=W20821 which we don't use anymore in the IA process.

I can't tell how IA used it or if they used it at all. You'd have to chase that down with IA.

eroux commented 1 year ago

@jimk-bdrc is there any way I can access an example archive that is sent to IA? I wonder how the image group directories are named. Or maybe you could point us to the script that produces the zip file?

jimk-bdrc commented 1 year ago

the script archive-ops/scripts/ia/depositIa.sh just archives what's on the rackstations, along with some custom metadata. A 2023-08-11 change iteratively adds image groups to the output archive in the order they are found in GetIGRIDList.py, (and changed by the Innnn hack) but that shouldn't matter to the IA extractor

Also, this work was uploaded to IA (2022-07-07) long before the "add one directory at a time" change (2023-08-11), so the above comments don't apply to this issue.

Best way to run this is to get the archive-ops from git, and then

...archive-ops/scripts/ia/deployment/copyLinksToBin [~/bin]`
rehash
cd  testd

depositIa.sh -a $(pwd) $PATH_TO_WORK_YOU_WANT_TO_ARCHIVE

Note that PATH_TO_WORK_YOU_WANT_TO_ARCHIVE can be the repository - it's not touched. Here's a sample that shows it working on multiple names of image groups

Here's a sample command and output depositIa-sample-output.txt