Open bmschmidt opened 8 years ago
That's the clean id, which is part of the PairTree structure HathiTrust uses. If we have something labelled 'filename', the clean id is correct.
I'm in favour of using the htid as often as possible and keeping the clean id behind the scenes. In Bookworm, we could store both filename and htid, emphasizing the latter.
I think, to keep things simplest, the files hitting Bookworm should never even know of the clean id; it's easily derived from htid, and I haven't yet seen a use case for it. It's true filename
is a required key in bookworm, but we shouldn't use cleanid
for it: bookworm.filename
(as opposed to filename
in a hathi context) is just a synonym for 'unique document id.' And that's better served through htid.
For some reason the 'filename' elements in the Bookworm use a 'filename' that replaces the hathi trust id with colons and slashes. (Eg,
psia.ark:/13960/t5z623168
becomespsia.ark+=13960=t5z623168
.) I assume this has something to do with certain ids not working as file paths on some operating system. But can it be corrected before the bookworm receives the filenames? It creates a number of problems all through the pipeline whenever we interface with Hathi resources, and it seems to me it would be much better if bookworm just received canonical hathi id.