Open Context collection file list .csv

elopatin-uc3 commented 3 years ago

We recently met with Eric and Sarah Kansa of Open Context, with regard to making the OC collection private in Merritt. They have already begun programmatically preferring Internet Archive for display of thumbnails on their website (single pages typically show ~10-20 images, while search results show 100 images). Because making their collection private will break thumbnail display if images are pulled from Merritt, they intend to transition to preferring IA for all objects and datasets. For this reason, Eric would like to make sure that everything they have in Merritt is also available at IA. One or more .csv files are needed for OC to review and confirm parity between Merritt and IA.

Tasks

[x] Generate a list of objects in the Open Context collection along with object level ERC metadata and export to .csv (975,079 objects; .csv is 228MB)
[x] Generate one or more .csv files which list each collection file (Producer files only).
Columns needed: inv_objects.ark, inv_objects.erc_who, inv_files.pathname, inv_files.full_size, inv_files.digest_value
Order by: inv_objects.ark, inv_files.pathname

terrywbrady commented 3 years ago

@elopatin-uc3 , the following will let you paginate through the set of arks (in order).

Change the offset from 0 to 1000 or 10000 to move through ranges.

This will give you a min and max ark. With that, the following query can be run efficiently.

If your object csv is already sorted by ark, you will not need to perform this step.

select
  o.ark
from 
  inv.inv_objects o
where
  o.inv_owner_id = (
    select 
      id
    from
      inv.inv_owners
    where 
      name='Open Context'
  )
order by
  o.ark
limit 1
offset 0;

Pull file info for a range of known arks.

select
  o.ark, 
  o.erc_who, 
  concat(
    'https://merritt.cdlib.org/d/',
    replace(replace(ark,':','%3A'),'/','%2F'),
    '/0/',
    substring(f.pathname,10)
  ), 
  f.full_size, 
  f.digest_value
from 
  inv.inv_objects o
inner join inv.inv_files f
  on f.inv_object_id=o.id
where
  o.inv_owner_id = (
    select 
      id
    from
      inv.inv_owners
    where 
      name='Open Context'
  )
and
  f.pathname like 'producer/%'
and 
  o.ark >= 'ark:/28722/k2000020h'
and
  o.ark <= 'ark:/28722/k20000c9w'
order by
  o.ark, f.pathname
;

Let me know if you would like a hand in scripting this up.

elopatin-uc3 commented 3 years ago

Thanks @terrywbrady. I've sent sample file and object lists in the form of csv's, asking them to comment on the scope of information/metadata these contain. Given the potential size of these files, I want to confirm whether or not we're gathering the appropriate amount of info for them.

elopatin-uc3 commented 3 years ago

@terrywbrady here is a sample of the URL format they are looking for, in order to cross-reference to their database: https://merritt.cdlib.org/d/ark%3A%2F28722%2Fk20000399/0/artiraq.org/static/opencontext/stoneware-media/full/photographs/UNE313.JPG

vs. the pathname we have stored: producer/artiraq.org/static/opencontext/stoneware-media/full/photographs/UNE313.JPG

Would removing the producer string and adding https://merritt.cdlib.org/d/ark... be possible during the query? I'm assuming they want the latest version of each file, but will confirm.

terrywbrady commented 3 years ago

@elopatin-uc3 , see the updated query above. This solution is not url encoding the file path, it is only url encoding the ark.

elopatin-uc3 commented 3 years ago

Thanks! Looks good @terrywbrady .

elopatin-uc3 commented 3 years ago

A series of 21 .csv files have been created. Each contains 50K ARKs, which translates to approximately 270K file entries per .csv. These have been posted on Box in a private folder to share with Eric and Sarah Kansa of Open Context.

CDLUC3 / mrt-doc

Open Context collection file list .csv #484

Tasks