emory-libraries / dlp-selfdeposit

0 stars 0 forks source link

Import Script Bug: Records containing binaries with no allowed mime types are being treated as containing binaries. #481

Closed bwatson78 closed 1 week ago

bwatson78 commented 2 weeks ago

For PID rrkmh, the only binaries the record contains are application/octet-stream, which isn't among our allowed mime types (see below). We must test whether any of the detected binaries contain an approved mime type, and if they don't, do not import their files and report that record as containing no binaries.

::ALLOWED_TYPES = {
    'application/pdf': 'pdf',
    'application/vnd.openxmlformats-officedocument.wordprocessingml.document': 'docx',
    'application/msword': 'doc',
    'application/vnd.openxmlformats-officedocument.presentationml.presentation': 'pptx',
    'application/vnd.ms-powerpoint': 'ppt',
    'image/jpeg': 'jpeg',
    'image/tiff': 'tiff',
    'image/png': 'png',
    'application/vnd.ms-excel': 'xls',
    'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': 'xlsx'
  }.freeze
bwatson78 commented 2 weeks ago

@eporter23 In the case of PIDs rhskh and rhsbj, these records has two binaries. One that contain the true filename within the LABEL value, and another that doesn't, but lacks an allowed mime type (aka application/octet-stream). In the case that we have only one binary with an allowed mime type, should we automatically cast that file as the content.<ext> automatically and ignore the octet stream?

bwatson78 commented 2 weeks ago

PR made: https://github.com/emory-libraries/dlp-selfdeposit/pull/482