archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Archivematica defaults to application/octet-stream when individual files are requested but it would be good to return the mimetype of contents of AIPs/DIPs (ArchivesDirect) #460

Open kimpham54 opened 5 years ago

kimpham54 commented 5 years ago

Please describe the problem you'd like to be solved.

We have an instance of ArchivesDirect and are accessing the DIPs that are being stored in Duracloud via our digital collections website. Having discussed this issue with @sallain, "Archivematica is setting the MIME type as application/octet-stream in the API store request itself. No attempt is made to generate the MIME type for a particular file."

Describe the solution you'd like to see implemented.

Archivematica passes on the mimetype for content in AIPs and DIPs so that it can be accessed in the Duracloud API header

Describe alternatives you've considered.

We are currently getting content from Duracloud and identifying mimetypes using a tool built into linux

Additional context


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

kimpham54 commented 5 years ago

Let me know how can I help! If you want to set up the development environment, this is is the right place to start: https://github.com/artefactual-labs/am/tree/master/compose. We also have a guide for contributors: https://github.com/artefactual/archivematica/blob/stable/1.8.x/CONTRIBUTING.md. My colleague Ross wrote this new walk-through that you may find useful: https://gist.github.com/ross-spencer/f7df0ad9e327555691485648002872e6.

The upload that you're looking at happens in Storage Service. In particular, duracloud.py which is a Space. When the AIP is compressed, I guess you'd want to identify the MIME type of the archive. IIRC, I think that would be easy since we know in advance the compression algorithm used and it's stored. When uncompressed, your best chance is to look up the MIME types of individual files inside the METS file of the AIP. My understanding is that you are interested on the latter use case. We use mets-reader-writer (https://github.com/artefactual-labs/mets-reader-writer) to read the METS, it makes very easy to extract the metadata of the files including all the things related to PREMIS.

ross-spencer commented 5 years ago

Hi @kimpham54, cc. @sallain and @sromkey in case it's useful for reference, I've done a little work to teach the storage service how to use file in this PR here to help identify AIPs without pointer files.

Specifically file is used in these lines.

The new module is in the common set of modules belonging to the storage service so that might provide some guidance where you could add new capability.

The way we extract compression types from pointer files is also moved in this PR to the new module.

So it might be a good start to get some of the information you require. You'd still need to get the information into the response header but that might not be too much work.

Did I read correctly that you wanted to get information about the other objects in the packages too?? That might be another fascinating problem!! But interested to see how a solution evolves, and echoing the response above, let me know if I can help at all.