Closed J08nY closed 2 years ago
@adamjanovsky This should make the tests fail yet it doesn't apparently we do not test full extraction out of a real certificate and compare the full output anywhere.
@J08nY there's also a stale issue #14. Could you pls discuss with @petrs and possibly address this as well? At first sight it should be trivial, just add few more regexes (that Petr already defined).
I am worried about performance though. We can't really just go and add regexes willy-nilly. Each regex costs something and there seems to be like 100+ of the APDU ones and I am not sure they are worth it (@petrs wdyt?). I am currently trying to optimize our regex matching a bit, so compiled regexes are kept and they are not recompiled for each and every document. Without that, these 100 APDU regexes might be like 30 minutes more processing for the full run.
@adamjanovsky This should now be ready.
Btw. I think we should refactor the way we do rule matching a bit. What we have currently is a list of named groups where each group is a list of regex rules. The issue with that approach is that if you want to work on it and do data analysis on a more granular level than just the groups you need to go into the result matches and pick out individual regex results:
"rules_symmetric_crypto": {
"AES": 2,
"AES-128": 3,
"AES-256": 4
}
This way we actually lose the information that all of these are results from one regex and that that regex matches AES variants. Our current rules structure also loses this info already, as we do not really have a policy on grouping things into one regex (only into rule groups). For example:
"rules_symmetric_crypto" = [
"AES",
"AES-?(128|192|256)",
...
]
vs something new like
"rules_symmetric_crypto" = {
"AES": [
"AES",
"AES-?(128|192|256)"
],
...
}
I am worried about performance though. We can't really just go and add regexes willy-nilly. Each regex costs something and there seems to be like 100+ of the APDU ones and I am not sure they are worth it (@petrs wdyt?). I am currently trying to optimize our regex matching a bit, so compiled regexes are kept and they are not recompiled for each and every document. Without that, these 100 APDU regexes might be like 30 minutes more processing for the full run.
ok, closing #14 with wont-fix
@adamjanovsky This should now be ready.
Btw. I think we should refactor the way we do rule matching a bit. What we have currently is a list of named groups where each group is a list of regex rules. The issue with that approach is that if you want to work on it and do data analysis on a more granular level than just the groups you need to go into the result matches and pick out individual regex results:
"rules_symmetric_crypto": { "AES": 2, "AES-128": 3, "AES-256": 4 }
This way we actually lose the information that all of these are results from one regex and that that regex matches AES variants. Our current rules structure also loses this info already, as we do not really have a policy on grouping things into one regex (only into rule groups). For example:
"rules_symmetric_crypto" = [ "AES", "AES-?(128|192|256)", ... ]
vs something new like
"rules_symmetric_crypto" = { "AES": [ "AES", "AES-?(128|192|256)" ], ... }
Ok, creating new issue #227 out of this. Could you please take a look at that? I don't see an easy solution atm. Alternatively we can discuss next week.
This moves towards #223.