Closed anjackson closed 4 months ago
For some added context, the list of file extensions from the KB e-depot was part of some work I presented in a 2015 blog post that was originally posted on the KB Research blog. Since that blog has sadly disappeared into the digital void (long story), here's a link to the post on my personal blog:
https://www.bitsgalore.org/2015/04/29/top-50-file-formats-in-the-kb-e-depot
On a related note, @caylinssmith referred me to this post from @johngostick - https://digitalpreservation-blog.lib.cam.ac.uk/identification-and-analysis-of-our-research-repository-file-formats-using-droid-fbb0d7d86222
Highlights...
- DROID was unable to identify 41% of the files it scanned in our repository storage
- There were 14471 unique file extensions!
I think they would be willing to make content profiles available, and the service those files are from is open access, which makes further investigation easier. This also relates to the idea of making shaeable format collection profiles. digipres/registries-of-practice-project#24
Not that there's likely much to be done with .dat
and .out
! Ah, brings back memories of my days writing physics simulations!
@caylinssmith also said that Leontien may also be able to share DROID profiles and similar from the older collections she works with (e.g. from old media formats like floppy disks etc.).
Other thing that just might be useful here, is Jason Scott's Discmaster website, which allows you to browse/search vintage computer files from archive.org (including search by extension!):
Other thing that just might be useful here, is Jason Scott's Discmaster website
If this is ever credited, then unnamed developer + volunteers might be more accurate. It'd be good to see credited names though whomever they may be.
I'm going to move the DISCMASTER idea over to a new issue, as it doesn't quite fit here in the current design. I can't really run bulk extension lookups, because that would hammer the service, so it's probably best if I smooth the path so people can easily find their way there.
The remainder of this has been implemented. Further polishing remains, but this issue can be closed.
e.g. an institution has a lot of 'format unknown' in their repository, based on e.g. PRONOM. Can we use file extension lookups to perform a similar kind of format analysis as when comparing registries? This would help understand what the benefits might be of integrating a different format identification tool. YUL are going to look at generating an extension list for this purpose.
Going through old notes, I also found that some time ago (2014) @bitsgalore made this kind of information available here as part of this comment on this also very relevant blog post.