Closed adamjanovsky closed 1 year ago
I applaud this effort! When I get more time hopefully I can help as well.
In the meantime, let me add a few pointers/issues I hit when working with the FIPS code:
clean_cert_ids
, or algorithms
in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.parse_html_main
, the logic there just is not good.* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.
Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?
* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.
Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?
Did this in #276.
I'll re-implement the folder structure of FIPS Dataset. Outline:
dset
├── FIPS Dataset.json
├── auxillary_datasets
│ └── algorithms
│ └── fips_algorithms.json
├── certs
│ ├── modules
│ └── policies
│ ├── pdf
│ └── txt
└── web
├── fips_modules_active.html
├── fips_modules_historical.html
└── fips_modules_revoked.html
@J08nY looking at the code here and there, I planned the following:
artifact_download_methods
, see https://github.com/crocs-muni/sec-certs/blob/9433658b55c6025567fe4101dae017afc9b4f8a0/sec_certs/dataset/fips.py#L57-L58This comes with some advantages:
Some disadvantages that cross my mind:
_download_policies()
and _download_modules()
: https://github.com/crocs-muni/sec-certs/blob/9433658b55c6025567fe4101dae017afc9b4f8a0/sec_certs/dataset/fips.py#L94-L108, https://github.com/crocs-muni/sec-certs/blob/9433658b55c6025567fe4101dae017afc9b4f8a0/sec_certs/dataset/fips.py#L78-L92. They are very similar, just one variable that changes in multiple places + some logging. The same problem applies to methods executed on the individual samplesgetattr("lame string of some variable")
. Second, if we break the pattern, we would need to introduce new design... Now's the time to speak up if you have comments/tips/whatever :).
Interesting approach but I am worried about a few things:
I guess I am really not sure what problem you are trying to solve by passing the callables. I will have to look at the code.
Looking at the code a bit. I think now I would just match the FIPS Dataset API to the CC Dataset API manually, even at the cost of some code duplication. I don't feel like it is a good time to rework the handling like this during this FIPS unification.
Ah yes, lets do this: https://refactoring.guru/design-patterns/template-method and don't pass callables around. Just call the methods directly, even if there might be some duplication I think it makes sense for future flexibility.
I was thinking about the logic of the download testing. At the moment, the situation is as follows:
pytest.xfail
if it did notxfail
. The problem with tested functionality is discovered only on repeated failures.
@J08nY Some resolution of the problem that you've pointed to in https://github.com/crocs-muni/sec-certs/pull/275/commits/90521384dc416b8dfb22f12b27e22e2f14ad50d6
@J08nY Do you have any idea how the cert_id normalization in FIPS works? I basically finished the refactoring, apart from:
IMO, ID of the certificate itself is fully determined by the contents from the module html page. What needs to get cleaned are various cert_ids detected as keywords in various sources: caveat text, pdf of the policy, etc. These point to other certificates, but are noisy and may point to Algorithms, or may just represent abitrary numbers, right?
So the goal, prior to constructing the graph of references, would be to prune the foreign cert_ids that are detected in FIPSCertificate
. Does this make sense? Formally, there's no cert_id normalization or anything like that, you just delete detected foreign cert_ids that you consider weak in a sense.
Also, I'll probably try to convince @GeorgeFI to take a look at building a graph of references, and also the transitive vulnerabilities. He did this for CC, so it makes sense to let him do this also here.
What do you think?
IMO, ID of the certificate itself is fully determined by the contents from the module html page. What needs to get cleaned are various cert_ids detected as keywords in various sources: caveat text, pdf of the policy, etc. These point to other certificates, but are noisy and may point to Algorithms, or may just represent abitrary numbers, right?
Yes, this is what was done in the original code in a way. The list of "certlike matches" was cleaned by removing the algorithm numbers the certificate references, either from the webpage or from the policy in some table (that has "algo" in the header :))
So the goal, prior to constructing the graph of references, would be to prune the foreign cert_ids that are detected in
FIPSCertificate
. Does this make sense? Formally, there's no cert_id normalization or anything like that, you just delete detected foreign cert_ids that you consider weak in a sense.
Yes this is what was done and is I think important to keep it that way, otherwise the reference graph has way too many false positives.
Also, I'll probably try to convince @GeorgeFI to take a look at building a graph of references, and also the transitive vulnerabilities. He did this for CC, so it makes sense to let him do this also here.
What do you think?
I don't think this is necessary, you can basically use ReferenceFinder
the same way as it is (and as it is in CC), I actually sort of unified this part already, you just need to provide it with a source of cert_id references for a given graph. Originally there were two sources (and I think it makes sense to keep them), one from the page and one from the security policy. W.r.t. the transitive vulns computation, that should also be doable just with some renames to point to the correct clean cert_ids.
@J08nY Ok, this is ready for review I suppose. Once we merge this, I intend to do #287, #231 and then release.
I checked that both CC and FIPS pipeline pass, tests pass, and notebooks are all executable and are producing "not-completely-off" plots. :D
Enjoy the review and don't be too harsh 🤞
The pyupgrade was done partially, or rather, some files are missing the from future import annotations which enables the Python 3.9-type annotations on Python >= 3.7, which we need as we target 3.8. See the pyupgrade readme and PEP 585.
I see, this is because pyupgrade
only works on files that have from __future__ import annotations
at the top. I've added a flake8 plugin that checks for its absence so that this doesn't repeat in the future.
I've added these import statements and upgraded the rest of the codebase with https://github.com/crocs-muni/sec-certs/pull/275/commits/6ce7007ab5b5f721abf6d8659ef4bd54e4f61590
It seems the whole PP dataset was included in the test data, was this intended? It shows up at almost 77k lines. I believe just downloading it from our site is a reasonable alternative. Although as it is already in git, removing it now is pointless (without a rebase and history editing the repository will have it forever).
Oh this hurts. It was added by accident. I'll probably rewrite the history...
Connected with the root_dir handling and serialization is the DUMMY_PATH issue. I see the current solution as an acceptable one, although I think there is room for improvement, we just need to identify the core issue in the API design and assumptions to see what needs to change. I think the current situation arises because our assumptions in different places in the code are not consistent or clear, specifically assumptions about whether a dataset object is backed by its JSON (directory structure) at different points in time as it is constructed from different sources.
You're perfectly right about the cause.
One possible way to resolve this is to say that the dataset object is always backed by its directory structure/JSON and explicitly require the path in all constructors, even the web ones. Other possible way is the opposite, to not assume the dataset is always backed, then disable the serialization features unless a path is set.
I was actually opting for the second, which is kind of the current state. The serialization will fail unless a path is set. See
It's just implemented with that DUMMY_NONEXISTING_PATH
variable...
I dislike that the root_dir property setter on dataset classes actually serializes the dataset. I think this is highly unintuitive and unexpected. I propose a more explicit solution in a comment in the review. The TL;DR; is to create a different method for moving/copying the dataset and make it the only method to change the root_dir.
I agree this can be counterintuitive. I'll take a look at that, thanks for spotting this.
High-level plan
pytest
, separate download tests from others, refactor test structure (split into cc/fips/others)76, #79
Tests
CPEDataset
andCVEDataset
Or forbid it and demand explicit paths (that will not get serialized omg). Either way, act consistently.test download on resources that we trust (e.g. our server)Forbid download on non-download test by auto-use fixture that raises exception onrequests.get
/requests.post
FIPS
Processing steps progress:
get_certs_from_web()
process_auxillary_datasets()
download_all_artifacts()
convert_all_pdfs()
analyze_certificates()
_compute_references()
_compute_transitive_vulnerabilities()
Random notes to do there:
redo
->fresh
, implement reatempting computations on failed samplesFIPSCertificate
and toFIPSDataset
Done outside of the processing steps:
[ ] Add examples of plot graph into FIPS reference notebookVerify if FIPS webpages are loading on scroll (historical modules). If yes, figure out how to force full load with BeautifulSoupFIPSDataset
contentsFIPSAlgorithmDataset
should reside in own json in auxillary datasetsMisc
fresh
on places where it does not makes senseSanity check
Still part of refactoring, but to-be-addressed by separate PRs
from __future__ import annotations