CCportal URL change - Githubissues

J08nY commented 3 months ago

Common Criteria portal recently moved from artifact urls of the form:

https://www.commoncriteriaportal.org/files/epfiles/2019-44-INF-3227.pdf

to the form

https://www.commoncriteriaportal.org/nfs/ccpfiles/files/epfiles/2019-44-INF-3227.pdf

Subsequently, this broke our certificate digest, as it was computed as a a hash of some metadata that included the cert_link. My fix in https://github.com/crocs-muni/sec-certs/commit/b333fd24dcbd1e4aa08584343cfb5b9784364a03 was to remove the new part in the URL, as it seemed the old links were still valid. This fixed the problem and did not cause havoc on the database on the server, which would mean we would lose history of the certificates.

However, it turns out that new certificates have links of the new form and the old form links do not work for their artifacts. Example being: (works) https://www.commoncriteriaportal.org/nfs/ccpfiles/files/epfiles/NSCIB-CC-2300182-01-CR2.pdf (doesn't work) https://www.commoncriteriaportal.org/files/epfiles/NSCIB-CC-2300182-01-CR2.pdf

I do not have time to fix this at the moment, nor do I have a good idea for a fix.

Maybe we could keep the URLs as-is in the metadata but turn them into the old format for the digest computation. I do not think there will be conflicts between new and old format.

adamjanovsky commented 3 months ago

Hm, in retrospect, it might have not been the greates idea to include a volatile attribute into the digest...

Your proposal would work for this situation, but not for a generic link change that we might face in the future. What would ditching the link in the digest involve? Could we retrospectivelly compute the digests with a new method on old data? Also, we would likely loose some functionality for the sec-certs.org urls with digests hardcoded in the papers, unless we introduce some backward compatibility...

J08nY commented 3 months ago

Hm, in retrospect, it might have not been the greates idea to include a volatile attribute into the digest...

I mean, we did not know it to be volatile right?

Your proposal would work for this situation, but not for a generic link change that we might face in the future. What would ditching the link in the digest involve? Could we retrospectivelly compute the digests with a new method on old data? Also, we would likely loose some functionality for the sec-certs.org urls with digests hardcoded in the papers, unless we introduce some backward compatibility...

We could ditch the link in the digest and do some dataset magic on the web. In fact there is already a translation layer for old dgsts for FIPS as they changed at some point. But I wonder whether it would bring collisions, I think the cert link is there for a reason.

Also, I am thinking about clearing the certificate history on the web anyway. It is polluted with entries dur to changes in our methodology or issues with the CCweb and has very little actual content. What do you think? This would mean there would just be a fresh run on the web with new dgsts and a translation layer for the old ones.

Replying from the Alpe-Adria bike path :)

adamjanovsky commented 3 months ago

But I wonder whether it would bring collisions, I think the cert link is there for a reason.

I think there were indeed some collisions, due to recertifications, etc.

Also, I am thinking about clearing the certificate history on the web anyway. It is polluted with entries dur to changes in our methodology or issues with the CCweb and has very little actual content. What do you think? This would mean there would just be a fresh run on the web with new dgsts and a translation layer for the old ones.

I'm not against it. However, shouldn't we check for any juicy data before we do so? Or maybe backup the list of changes at least.

Replying from cottage office in highlads :). Enjoy your stay 🚲.

J08nY commented 3 months ago

But I wonder whether it would bring collisions, I think the cert link is there for a reason.

I think there were indeed some collisions, due to recertifications, etc.

Is there anything else that we would want to include in the digest then?

Also, I am thinking about clearing the certificate history on the web anyway. It is polluted with entries dur to changes in our methodology or issues with the CCweb and has very little actual content. What do you think? This would mean there would just be a fresh run on the web with new dgsts and a translation layer for the old ones.

I'm not against it. However, shouldn't we check for any juicy data before we do so? Or maybe backup the list of changes at least.

Hmm, we can definitely back it up. That is not a problem, it can be loaded into MongoDB and we can come back to it any time we choose to. Going through it though would be quite hard as there is too much noise from our tool changes and little actual changes. I mean, when does the data ever change? We may be able to look at the artifact hashes and only analyze changes that changed those (but not when they went missing or came back) as those actually mean something changed in the source data. We do not have such a hash for the CSV data though.

So I propose, lets figure out a new digest, backup the old change data, do a fresh run on the web and have a translation layer for old dgsts onn the web. That way we can analyze the backed up change data as I propose above, but the web is clean.

Replying from cottage office in highlads :). Enjoy your stay 🚲.

Nice!

adamjanovsky commented 2 months ago

So I propose, lets figure out a new digest, backup the old change data, do a fresh run on the web and have a translation layer for old dgsts onn the web. That way we can analyze the backed up change data as I propose above, but the web is clean.

This makes sense. Regarding analysis of the changes, I think it makes sense to look at changed pdf files that we can detect via hashes.

About the digest contents, it's hard to tell for me now. It would be ideal to build the digest out of the pdf documents as they define the identity of the certified product in the end. The problem is that we cannot defer digest construction to the point when we obtain the pdfs. I think that I attempted to surrogate ths by including the ST download link to the digest.

🤷‍♂️

J08nY commented 2 months ago

About the digest contents, it's hard to tell for me now. It would be ideal to build the digest out of the pdf documents as they define the identity of the certified product in the end. The problem is that we cannot defer digest construction to the point when we obtain the pdfs. I think that I attempted to surrogate ths by including the ST download link to the digest.

🤷‍♂️

Could we add just the filename parts of the cert, report and ST links? Those should be somewhat robust.

adamjanovsky commented 2 months ago

Yeah, that could help 👍

crocs-muni / sec-certs

CCportal URL change #420