archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: if file format identification fails, there is no way to retry using a different method #1190

Open peterVG opened 4 years ago

peterVG commented 4 years ago

Please describe the problem you'd like to be solved Assuming Siegfried is set as the enabled Identification tool, if it fails to identify a file's format, then processing will continue without having reliable information on which to base characterization, extraction, normalization, transcription, and validation microservices. The same could happen if Fido or "Identifying by File Extension" were the enabled Identification option.

Describe the solution you'd like to see implemented One possible solution is if the enabled Identification tool fails, re-trying Identification with the second prioritized option, if that fails, try with the third prioritized option.

However, running another tool isn't necessarily a good solution because it could gloss over an issue with the files, if the second tool is more permissive than the first.

Another possible solution is running Seigfried or FIDO against other registries besides PRONOM.

Other possible approaches should be considered.

Describe alternatives you've considered

Additional context Related to https://github.com/archivematica/Issues/issues/501 Related to https://github.com/archivematica/Issues/issues/584 Related to https://github.com/archivematica/Issues/issues/860


For Artefactual use:

Before you close this issue, you must check off the following:

mjaddis commented 4 years ago

I've done something similar to this in Archivematica, so some thoughts/comments based on that. I wrote a wrapper around Siegfried and FIDO so it runs Siegfried first and then FIDO if Siegfried doesn't identify a given file format. It also has some nasty 'desperate measures' fallbacks in case neither Siegfried nor FIDO work, e.g. using the file extension whereby '.wp4' is assumed to be 'fmt/949' (yes, dangerous approach I know, but some old word processing files don't match at all to PRONOM.

There are a few nuances that might be interesting:

Firstly, I run Siegfried first rather than FIDO because we've found that Siegfried is a lot faster on big TIFF files. FIDO can take ages. So Siegfried first speeds up file format identification for datasets with lots of big images, e.g. digitised books. But, Siegfried isn't always great with MS Office formats (Word, ppt etc.). It doesn't fail but instead sometimes returns fmt/111 (OLE2 Compound Document Format). This doesn't help that much in the rest of the Archivematica workflow because it's too generic. But it's not a failure either. Therefore, what I do is check if Siegfreid reports fmt/111 and if it does then I run FIDO afterwards. FIDO often comes up with a more specific PRONOM ID, e.g. a particular MS Word version. The net result is that we get the speed of Siegfried for large images and the precision of FIDO for Office formats (which is important in our case because we also do normalisation of office formats through other FPR additions we run).

So my suggestion is not only to look at whether a file format identification tool fails or not, but also the result it gives. On that basis the next tool in the chain can be tried to see if it does a better job. Likewise, the order of tools can make a difference to performance and making this configurable would be cool.

Hope this helps.

sromkey commented 4 years ago

@peterVG (or anyone interested) this issue might be phrased better as a problem statement, rather than a solution, so that a variety of use cases could be addressed. Interesting idea to discuss!

peterVG commented 4 years ago

@sromkey good suggestion, just implemented.

richardlehane commented 3 years ago

Joining this discussion late, but I'd recommend taking that second approach and swapping registries rather than tools in this scenario. It seems pretty dangerous to me to go tool shopping trying to get the result you are after, rather than identifying the root cause. If siegfried isn't matching (or is giving you too generic a result), this should mean that PRONOM hasn't matched: either because of some subtle issue with the file or because the PRONOM signature is too finickity. In such a case, it could make sense to see what a different registry (like freedesktop.org or wikidata) thinks, providing you are then able to handle a non-puid ID.

In terms of assessing siegfried's accuracy, I'd recommend looking at the benchmarks page. On that page you can see not only how the different tools compare on speed but also the extent to which they agree/disagree on identifications. You can filter those pages to see how they do on different classes of formats (e.g. @mjaddis filter that govdocs table for .ppt or .doc to see how they do on MS Office formats).