Bookworm-project / Hathitrust-Bookworm

A full text Bookworm on Public Domain Hathitrust works
6 stars 1 forks source link

Use actual IDs for next build (no = or + in htid) #2

Open bmschmidt opened 8 years ago

bmschmidt commented 8 years ago

For some reason the 'filename' elements in the Bookworm use a 'filename' that replaces the hathi trust id with colons and slashes. (Eg, psia.ark:/13960/t5z623168 becomes psia.ark+=13960=t5z623168.) I assume this has something to do with certain ids not working as file paths on some operating system. But can it be corrected before the bookworm receives the filenames? It creates a number of problems all through the pipeline whenever we interface with Hathi resources, and it seems to me it would be much better if bookworm just received canonical hathi id.

organisciak commented 8 years ago

That's the clean id, which is part of the PairTree structure HathiTrust uses. If we have something labelled 'filename', the clean id is correct.

I'm in favour of using the htid as often as possible and keeping the clean id behind the scenes. In Bookworm, we could store both filename and htid, emphasizing the latter.

bmschmidt commented 8 years ago

I think, to keep things simplest, the files hitting Bookworm should never even know of the clean id; it's easily derived from htid, and I haven't yet seen a use case for it. It's true filename is a required key in bookworm, but we shouldn't use cleanid for it: bookworm.filename (as opposed to filename in a hathi context) is just a synonym for 'unique document id.' And that's better served through htid.