digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
275 stars 74 forks source link

Read bandwidth for a ZIP file depends on combination of versions binary signature file / container file #906

Open mikaelmechoulam opened 1 year ago

mikaelmechoulam commented 1 year ago

Hello, When identifying a ZIP file which matchBinarySignatures (Droid 6.6), a profiling tool shows that the ZIP file seems to be read fully one or several times (this ratio depending on the version of the chosen signature file). When attached to the DroidUI tool (and when we choose not to analyse content of archive files, of course), the profiling tool shows a very low bandwidth. The max bytes scan is set to the same value (maximum value). For example for a 4Gb file, when using the code the 4Gb are read, when using DroidUI only a few kilobytes. We tried the same comparison with a MP4 file, there is not this discrepancy. Do you have an explanation ? Thanks,

steve-daly commented 1 year ago

Hi @mikaelmechoulam can you confirm which mechanism you're using when you see the slow performance? Is this the new Java API or one of the other command-line mechanisms? Could you give a command or code snippet to represent what you're using so we could look to reproduce it.

mikaelmechoulam commented 1 year ago

@steve-daly We reproduced both by calling directly matchBinarySignatures and with the new Java API, with the following code: final DroidAPI droidAPI = DroidAPI.getInstance( Paths.get("../droid/DROID_SignatureFile_V90.xml"), Paths.get("../container-signature-20220704.xml")); final List<ApiResult> results = droidAPI.submit(Paths.get("../hugezip.zip"));

steve-daly commented 1 year ago

Thanks @mikaelmechoulam That Java API has only been added for internal testing at this stage so we wouldn't recommend using it in any production applications yet. We'll look into this performance issue though.

Note that our main version is still 6.5.2 currently and although 6.6.0 is visible on GitHub we'll hopefully be releasing 6.6.1 shortly to fix some issues noticed, and that version will be accompanied by some additional information about all the new features since 6.5.2.

mikaelmechoulam commented 1 year ago

@steve-daly Thanks, but please note we reproduce the issue in earlier versions (6.5.2 for example) and without using the API (by calling matchBinarySignatures )

steve-daly commented 1 year ago

Thanks, we'll take a look. Just to mention that you're using an unmatched pair of signature definition files there. V90 is from 2017 and should match container 20170330, whereas the container sig used (20220704) should match binary v107. That's not going to explain your issue here, but it may cause other undefined identification behaviour.

mikaelmechoulam commented 1 year ago

@steve-daly We have more details( we're still investigating to help you). Indeed, the main factor seems to be the combination container/binary file. It has nothing to do with a difference between DroidUI and a call to Droid libraries. But surprising as it may seem, for example if we use a "latest" combination container-signature-20221102 and binary v110, it has this huge bandwidth (read throughput = around the size of the file !). It is even worse with the combination container20220704-binary v106 (the file is read around 4 times !). We could have expected the contrary. Strangely some other combinations show a very low bandwidth (only a few kb) with a success. But in fact how do we know what is a "proper" combination between container and binary (only the time of the year?). Where is the link between the two files ? And why some combinations have this effect on bandwidth ?

steve-daly commented 1 year ago

Thanks @mikaelmechoulam it's interesting to see different behaviour with different signatures. If you look in the header of the binary signature XML file you should see a creation date which should point to the appropriate container signature that accompanies it. That said, I think I can see a discrepancy with the current signature set, which I'll look into now.

ZIP files specifically will trigger the container signature logic, where it tries to look for other files which use the ZIP format under the hood (such as Word Documents) and this means many more comparisons, but maybe there's something unexpectedly varying and causing this behaviour, so we'll look into this.

mikaelmechoulam commented 1 year ago

@steve-daly Here is an example of comparison of read bandwidths for a 4 Gb file. image We are annoyed because today we have to recommend our customer to use a newer container file with his older signature file, OR to modify this signature file manually (which is not very good in term of future upgrade). I have changed the title of the ticket to be more accurate.

ross-spencer commented 1 year ago

@mikaelmechoulam @steve-daly really interesting issue. I was wondering if the same behavior is seen in Nanite which is a third party tool by Andy Jackson at the British Library that uses DROID and possibly implements something close to what you're doing Mike? https://github.com/openpreserve/nanite (Not sure how easy it is for you to measure the same effect?). If the problem doesn't occur then maybe Nanite also demonstrates an implementation pattern you can adopt to avoid this?

RE: The combinations table above - clearly there are more file formats in both sets of signature file, i..e the latest container signature and non-container signature. Perhaps the number of permutations of potential match in the non-container signature are increasing with v111 and 2023-03-07 and amplifying the behavior?

RE: Behavior in the UI, you may need to check too the maximum bytes to scan is set to -1 to recreate the issue in the first place reliably - Image below shows the default 65535:

image

steve-daly commented 1 year ago

Hi @mikaelmechoulam I suspect that the forth item in your comparison grid where you have unmatched Binary and Container signature files is probably just causing DROID to not run detection fully, rather than somehow working perfectly but faster.

It's interesting to see the correct pairing for v90 is also showing the behaviour (although not as bad as with v106) so it's not a recent introduction in the signatures, which I was going to check for.

Could you try a binary-only match, for example by using 6.5.2 and running 'no-profile mode' on the command line by using the -Nr switch (followed by the file/folder to check) and passing the binary signature file location using the -Ns switch. That would avoid all the checking that happens following a match with ZIP format

anjackson commented 1 year ago

I might be misunderstanding how DROID works, but I think scanning the whole thing is the expected behaviour for ZIP files, and has been for some time. I wrote up my reasoning in this blog post rather than take up space here. But briefly: there are binary signatures that use an unbounded wildcard after the ZIP signature (e.g. 504B0304*4D45...), and as no byte limit is set, I'd expect a full file scan to happen for all ZIP files, multiple times (once for each binary signature of this form).

If that's right, then I think @mikaelmechoulam may be seeing a cache exhaustion effect, where for some reason the system DROID is running on no longer has enough free memory to effectively cache large files.

EDIT 2023-03-22: I don't want to take any more space up in this thread, but just wanted to thank everyone for a very helpful discussion which has improved my understanding of how DROID works. There's some details here if anyone's really that interested!

tnafrancesca commented 1 year ago

@mikaelmechoulam having read the blog that @anjackson wrote we made a draft modification to the binary signature, that if it works well we can hopefully add to the next official PRONOM release. It seems to speed up DROID internally for us.

Does using the signature in the zip file, alongside the latest container signature speed up DROID processing? Just to confirm the binary signature in this file is just for testing purposes and not an official release.

The difference in this signature is that we have completely removed the binary signature for fmt/161 and x-fmt/412 as they already have container signatures. If it works well then we will have to do some tests to see if the identification is affected by not having the binary signature, which it doesn't seem to be so far.

The plan is to do a bit more research into how many file formats are identified by both container and binary signature and whether this is required in each case. Especially those with the * wild card, which @anjackson is completely correct, definitely slow down processing.

DROID_SignatureFile_draft_V109.zip

tnafrancesca commented 1 year ago

Here is a rough mapping of each binary signature and the corresponding container signature:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Binary Signature | Container Signature -- | -- V46-59 | 04 February 2011 V60-61 | 11 June 2012 V62-65 | 28 August 2012 V66 | 18 December 2012 V67 | 26 February 2013 V68-71 | 01 May 2013 V72 | 12 November 2013 V73-74 | 27 February 2014 V75-77 | 17 July 2014 V78-80 | 23 September 2014 V81 | 18 February 2015 V82 | 27 March 2015 V83 | 17 December 2015 V84 | 21 January 2016 V85 | 29 June 2016 V86-87 | 27 July 2016 V88 | 27 September 2016 V89 | 08 March 2017 V90 | 30 March 2017 V92 | 20 September 2017 V93 | 30 November 2017 V94-95 | 20 September 2018 V96 | 21 January 2020 V97 | 01 October 2020 V98-99 | 27 October 2021 V100-102 | 16 December 2021 V103-105 | 11 March 2022 V106-107 | 04 July 2022 V108 | 05 September 2022 V109 | 02 November 2022 V111 | 16 March 2023

mikaelmechoulam commented 1 year ago

First thanks @tnafrancesca, @anjackson @steve-daly and @ross-spencer for your help and suggestions. Here are our current conclusions.

Our investigation revealed that there was no deletion of signatures in the relevant files over the versions, but there was a mechanism that allowed for the removal of signatures previously stored in memory and related to containers during the reading and initialization of containers (Class ContainerIdentifierInit, line 79: droidCore.removeSignatureForPuid). 

Upon verification, this method appears to be called for the PUIDs x-fmt/412 and fmt/161, and we obtained similar performances with an official V109 signature file and the one you provided to us (the whole file is read).  

The search for the signature with ID 1986 (AGS 4 Data Format, fmt/1649) appears to be the most costly (several seconds) with the DROID_SignatureFile_V109.xml and container-signature-20221102.xml files. Removing this signature gives a read bandwidth of 1036 Kb instead of 1298 MB (which is near the size of the file).  

To improve performance for our client, we removed some signatures based on the ZIP format in the V90 signature file that were not identified by container via the container-signature-20170330.xml file, which were costly and not necessary to identify for this client. The relevant formats are now identified as simple ZIPs (fmt/139, fmt/296, fmt/297, and fmt/161 like you, but which was not in the container file at the time).

Perhaps this approach of customizing the signature file for our client is the best option, and that poor performance is inherent to the strategy of a shared signature file? The article https://martin.hoppenheit.info/blog/2017/minimizing-the-droid-signature-file appears to suggest this.

Dclipsham commented 1 year ago

Interesting.

AGS 4 is a rare signature type where the sequence is listed as fully variable, so if the file is say 20GB it'll scan the whole thing looking for a match.

Given this is a format from 1991, it could be that instances of this format may be in the KB size and possibly low MB range, although I note that the spec continues to get regular revisions, so maybe larger.

As such even though the signature sequence itself may be necessarily variable in nature, it might be more efficient to bound it to a BOF with a sizable (but therefore not-infinite) offset range.

@tnafrancesca and rest of PRONOM team if the samples you have are sharable I'd be happy to have a look and offer advice.

David

Dclipsham commented 1 year ago

In fact, from the samples found here: http://ags-archive.forumcourt.myzen.co.uk/datatransferv4/example.php and the description in the spec https://www.ags.org.uk/content/uploads/2022/02/AGS4-v-4.1.1-2022.pdf "At the top of the hierarchy is the PROJ Group, with the majority of other Groups below this." - section 3.1, page 5

...it sounds like the signature can be anchored near the BOF without a particularly big range, and doesn't seem to require full variability at all.

Dclipsham commented 1 year ago

I've just had a quick check and at least as far as v109 (I've yet to add the latest release to my working data - should be this week though), of the 90 variable signature sequences, only fmt/1649 is NOT alongside an additional BOF or EOF sequence that anchors the signature and therefore prevents full file scanning, so changing that to something like BOF, max offset 64 or so (will need a wider sample pool to confirm that 64 is sufficient, but this seems reasonable from the samples I've harvested), plus removing binary signatures where formats already have container signatures, should have a noticeably positive impact on processing time.

Do note that, where binary signatures already existed at the time of container signature creation, these were intentionally retained, as to ensure backwards-compatibility for those formats with DROID 5, so there was a rational reason for keeping them, but it has been about 12 years, so one would hope that all users are at least on some version of DROID 6.x by now.

Of course there's nothing to stop anybody implementing custom signature files and we do it in my organisation too for some of our more novel use-cases, but for the core registry it would be preferred if these kinds of issues can be tackled for the benefit of all

Dclipsham commented 1 year ago

Here is a complete list of formats as of v109 that have both a binary and a container signature present:

fmt/40 - Microsoft Word Document 97-2003 fmt/61 - Microsoft Excel 97 Workbook (xls) 8 fmt/125 - Microsoft Powerpoint Presentation 95 fmt/126 - Microsoft Powerpoint Presentation 97-2003 fmt/136 - OpenDocument Text 1.0 fmt/137 - OpenDocument Spreadsheet 1.0 fmt/138 - OpenDocument Presentation 1.0 fmt/139 - OpenDocument Graphics 1.0 fmt/140 - OpenDocument Database Format 1.0 fmt/161 - SIARD (Software-Independent Archiving of Relational Databases) 1.0 fmt/290 - OpenDocument Text 1.1 fmt/291 - OpenDocument Text 1.2 fmt/292 - OpenDocument Presentation 1.1 fmt/293 - OpenDocument Presentation 1.2 fmt/294 - OpenDocument Spreadsheet 1.1 fmt/295 - OpenDocument Spreadsheet 1.2 fmt/296 - OpenDocument Graphics 1.1 fmt/297 - OpenDocument Graphics 1.2 fmt/482 - Apple iBook format fmt/483 - ePub format x-fmt/88 - Microsoft Powerpoint Presentation 4.x x-fmt/412 - Java Archive Format x-fmt/430 - Microsoft Outlook Email Message 97-2003

tnafrancesca commented 1 year ago

@Dclipsham thank you for the list. We also had fmt/39 in ours but apart from that same results?

Just to provide some reassurance there is no plan to delete all the binary signatures that are replicated without full consideration. We were discussing support for other versions of DROID while looking into this. We would also look at the history of each signature/ test ourselves and test with the community before doing this work. If it substantially sped up DROID it could be worth it if it didn't impact the identification. Our survey has also just closed so I'm hoping now is a good time at taking a look at how far back we should be supporting DROID, it should give us an insight into how many people are using the different versions. I think some of the previous versions without container signatures are already tricky to run on modern laptops.

I will see if the file samples for AGS4 can be shared on here- or if not we can work with you/ take another look at the signature internally and change it for the next release.

Dclipsham commented 1 year ago

Cheers Francesca,

Oh I'm all for removing them. I would be extremely surprised to learn if there's any need to support DROID 5, but obvs that's just my personal opinion.

It would for sure be useful to measure any impact to identification outcome though - I would hope this would be minimal.

Yep fmt-39 was on my list too I just didn't copy far enough up! so that's: fmt/39 - Microsoft Word Document 6.0/95

One further observation, is that both the following both share a binary signature: fmt/61 - Microsoft Excel 97 Workbook (xls) 8 fmt/62 - Microsoft Excel 2000-2003 Workbook (xls) 8X

But only fmt/61 has a Container signature. This one certainly warrants closer attention.

Hope this is of use!

kathaurielle commented 1 year ago

Hi all, inc @Dclipsham,

I looked at the ags samples we have and the offsets do vary wildly, these are they: HE Geodata file 1448.ags 23 HE Geodata file 1767.ags 9 HE Geodata file 500.ags 11 HE Geodata file 503.ags 7 no-has star-HE Geodata file 1511.ags 9 no-HE Geodata file 1684.ags 1,393,308 no-HE Geodata file 509.ags 240,474

I can only think of giving it a massive offset, and responding to anyone who says their file isn't IDing and then widening it further, but am I missing a more sophisticated solution?

All samples are sharable and were from here: https://webapps.bgs.ac.uk/services/ngdc/accessions/index.html and maybe elsewhere in public domain, but defo all public.

kathaurielle commented 1 year ago

Actually Steve's pointed out, the first byte is 22, so a 22 at the BOF would help. These are the first bytes of each. I'm not convinced adding "GROUP" | "PROJ" would capture them all. Going to try to find some more files. 22 47 52 4F 55 50 22 2C 22 50 52 4F 4A 22 0D 0A 22 47 52 4F 55 50 22 2C 22 50 52 4F 4A 22 0D 0A 22 47 52 4F 55 50 22 2C 22 43 44 49 41 22 0D 0A 22 47 52 4F 55 50 22 2C 22 43 44 49 41 22 0D 0A 22 2A 2A 50 52 4F 4A 22 0D 0A 22 2A 50 52 4F 4A 22 2A 2A 50 52 4F 4A 22 0D 0A 22 2A 50 52 4F 4A 22 2A 2A 50 52 4F 4A 22 0D 0A 22 2A 50 52 4F 4A "GROUP","PROJ" "PROJ""PROJ "GROUP","CDIA"

Dclipsham commented 1 year ago

Thanks Kathryn, that's really useful.

I very much agree that anchoring to 0x22 would be a useful start and there's strong potential in something like "(GROUP|PROJ)" or '22(47524F5550|2A2A50524F4A)22' if expressed as a sig sequence. Even if the variability proves to be too much and the starter group codes are many things other than just GROUP and PROJ, something like: "{0-2097152}PROJ_ID or signature sequence '22{0-2097152}50524F4A5F4944222C22'

...should be far more efficient than the current variable sequence. This would basically be looking for the PROJ_ID within the first 2MB of a file, but only if the opening character is a double-quote. again might not catch everything tho and possibly not ideal if the format allows additional comments before the opening group code, but it's wide enough to deal with the furthest offset you've observed so far, which was around 1.3MB.

I'm conscious that this has derailed from the initial issue though so perhaps this one needs its own, separate to the ZIP/Container handling?

kathaurielle commented 1 year ago

Thanks David, I've strengthened and tested the sig, and now that it has a BOF sequence, which should help with the speed issue: Offset 0-4 from beginning of file, magic bytes: "(GROUP|PROJ)"“{0-1}PROJ_ID",{0-1}("PROJ_NAME",”|"PROJ_AGS",")“{0-1}ABBR_HDNG",”{0-1}ABBR_CODE","
Position type Absolute from BOF Offset 0 Maximum Offset 4 Byte order
Value 22(47524F5550|2A2A50524F4A)22
22{0-1}50524F4A5F4944222C22{0-1}(50524F4A5F4E414D45222C22|50524F4A5F41475322)*22{0-1}414242525F48444E47222C22{0-1}414242525F434F4445222C22

Dclipsham commented 1 year ago

Thanks Kathryn, I'll run that through my test set tomorrow. David