For some schemes there is more information on the pages they provide that we do not collect currently:
[x] Australia: We can follow the link for products in evaluation and extract more stuff there.
[x] France: We can follow the link and extract more stuff. Also we can download the documents from there and compare against those from CC portal.
[x] Germany: We can follow the link and extract more stuff. Also we can download the documents from there and compare against those from CC portal.
[x] Japan: We can follow the link and extract more stuff. Also we can download the documents from there and compare against those from CC portal. This can be also done for the in evaluation list.
[x] Norway: We can follow the link and extract more stuff. Also we can download the documents from there and compare against those from CC portal. This can be also done for the archived list.
[x] Korea: We can follow the link and extract more stuff. Also we can download the documents from there and compare against those from CC portal. This can be also done for the archived list.
[x] Sweden: We can follow the link and extract more stuff. Also we can download the documents from there and compare against those from CC portal. This can be also done for the archived and in evaluation lists.
[x] USA: We can follow the link and extract more stuff. Also we can download the documents from there and compare against those from CC portal. The product pages contain additional documents with details of the evaluation process that should have their own processing pipeline ideally.
[x] Netherlands
[x] Singapore
Also, we should create a mapping between these collected entries and the main CC portal dataset entries. The matching could be done as follows:
Try to match on canonicalized cert_ids (if the scheme page has them, maybe we can also extract them from the documents if the scheme has them, atleast the filenames).
Maybe try to match on document hash if the scheme has them.
Match on product name + category + vendor?
Give up matching?
After we have this we should evaluate how good the matching is.
For some schemes there is more information on the pages they provide that we do not collect currently:
Also, we should create a mapping between these collected entries and the main CC portal dataset entries. The matching could be done as follows:
After we have this we should evaluate how good the matching is.