Open eroux opened 1 year ago
not sure what's happening here... having a look back at https://github.com/buda-base/public-digital-library/issues/684 everything still looks fine with https://library.bdrc.io/show/bdr:MW1PD95844?part=bdr:MW1PD95844_0243 (volume 111, same on IA) any idea?
there seems to be something wrong with volumes on IA:
and so on, then volume 18 is missing:
then the last available volume is:
Hmmm, I think I have a guess as to what happened, I think during upload to IA the volume RIDs were sorted in alphabetical order, but it's a case where the RID alphabetical order is not the volume order:
It was uploaded a long time ago, perhaps that was a bug in the upload script back then... wdyt @jimk-bdrc ? if that's correct, what is a date after which uploads to IA don't have that issue? We probably should reupload those with potential problems (probably not a lot)
@eroux AO never ordered anything in an IA upload - our processes only upload a single archive file.
I first suspected that the marc-W20821.xml
(created from https://purl.bdrc.io/resource
by archive-ops/scripts/ia/create-archive-metadata.sh
) would contain the Image group to volume sequence map, but it does not.
The only ordering indication I find is in Archive0/21/W20821/meta/W20821.xml
and I'm not entirely sure what it means. That file was generated by the older https://legacy.tbrc.org/xmldoc?rid=W20821
which we don't use anymore in the IA process.
I can't tell how IA used it or if they used it at all. You'd have to chase that down with IA.
@jimk-bdrc is there any way I can access an example archive that is sent to IA? I wonder how the image group directories are named. Or maybe you could point us to the script that produces the zip file?
the script archive-ops/scripts/ia/depositIa.sh
just archives what's on the rackstations, along with some custom metadata. A 2023-08-11 change iteratively adds image groups to the output archive in the order they are found in GetIGRIDList.py
, (and changed by the Innnn hack) but that shouldn't matter to the IA extractor
Also, this work was uploaded to IA (2022-07-07) long before the "add one directory at a time" change (2023-08-11), so the above comments don't apply to this issue.
Best way to run this is to get the archive-ops from git, and then
...archive-ops/scripts/ia/deployment/copyLinksToBin [~/bin]`
rehash
cd testd
depositIa.sh -a $(pwd) $PATH_TO_WORK_YOU_WANT_TO_ARCHIVE
Note that PATH_TO_WORK_YOU_WANT_TO_ARCHIVE
can be the repository - it's not touched. Here's a sample that shows it working on multiple names of image groups
Here's a sample command and output depositIa-sample-output.txt
A user reported:
http://purl.bdrc.io/resource/MW20821 - the links to the internet archive seems to have an off by one in the volume numbers.
I know the archive numbering scheme is a bit confusing (afaict, they tend to use no syffix for the first volume, -1 for the second, etc), so the link for, say
http://purl.bdrc.io/resource/MW20821_A912F4
that is given as
https://archive.org/details/bdrc-W20821/bdrc-W20821-24/page/n50/
seems to make sense, b/c it's in the volume 25 (so one would expect -24 as the archive's suffix). However, the correct link is actually
https://archive.org/details/bdrc-W20821/bdrc-W20821-25/page/n50/mode/2up
(that will be redirected for .../n49/ for the 2up spread with page 50 on the right).
I haven't debugged where the numbering got skewed...
Thanks!