Bulk extension lookup for format index data

digipres / workbench

DigiPres Workbench

https://digipres.org/workbench/

2 stars 0 forks source link

Bulk extension lookup for format index data #6

Closed anjackson closed 4 months ago

anjackson commented 7 months ago

e.g. an institution has a lot of 'format unknown' in their repository, based on e.g. PRONOM. Can we use file extension lookups to perform a similar kind of format analysis as when comparing registries? This would help understand what the benefits might be of integrating a different format identification tool. YUL are going to look at generating an extension list for this purpose.

Going through old notes, I also found that some time ago (2014) @bitsgalore made this kind of information available here as part of this comment on this also very relevant blog post.

bitsgalore commented 7 months ago

For some added context, the list of file extensions from the KB e-depot was part of some work I presented in a 2015 blog post that was originally posted on the KB Research blog. Since that blog has sadly disappeared into the digital void (long story), here's a link to the post on my personal blog:

https://www.bitsgalore.org/2015/04/29/top-50-file-formats-in-the-kb-e-depot

anjackson commented 7 months ago

On a related note, @caylinssmith referred me to this post from @johngostick - https://digitalpreservation-blog.lib.cam.ac.uk/identification-and-analysis-of-our-research-repository-file-formats-using-droid-fbb0d7d86222

Highlights...

DROID was unable to identify 41% of the files it scanned in our repository storage

There were 14471 unique file extensions!

I think they would be willing to make content profiles available, and the service those files are from is open access, which makes further investigation easier. This also relates to the idea of making shaeable format collection profiles. digipres/registries-of-practice-project#24

Not that there's likely much to be done with .dat and .out! Ah, brings back memories of my days writing physics simulations!

@caylinssmith also said that Leontien may also be able to share DROID profiles and similar from the older collections she works with (e.g. from old media formats like floppy disks etc.).

bitsgalore commented 7 months ago

Other thing that just might be useful here, is Jason Scott's Discmaster website, which allows you to browse/search vintage computer files from archive.org (including search by extension!):

https://discmaster.textfiles.com/search

ross-spencer commented 7 months ago

Other thing that just might be useful here, is Jason Scott's Discmaster website

If this is ever credited, then unnamed developer + volunteers might be more accurate. It'd be good to see credited names though whomever they may be.

anjackson commented 4 months ago

I'm going to move the DISCMASTER idea over to a new issue, as it doesn't quite fit here in the current design. I can't really run bulk extension lookups, because that would hammer the service, so it's probably best if I smooth the path so people can easily find their way there.

anjackson commented 4 months ago

The remainder of this has been implemented. Further polishing remains, but this issue can be closed.