Closed J08nY closed 1 year ago
Can we search for the cert ID in the filename of the PDF documents? I think we change their filename, but we should keep the track of the original one. This was done previously by petrs. I guess he was doing it for a reason, maybe it could identify some missing cert IDs. Besides, in methodology we claim that we do this :D
PR #258 did most of the work here (the part that made sense), check it out for more details.
Currently, we have a bunch of cert_id duplicates and a bunch of certificates without a cert_id matched.
Duplicates
Current list: https://gist.github.com/J08nY/d03714234198d16a41aa931e956ee647 MongoDB command:
db.cc.aggregate([{$group: {_id: "$heuristics.cert_id", count: {$sum: 1}}}, {$sort: {count: -1}}])
scheme != FR
).valid_from
. This needs some verification that this will not remove valid cert_ids.Without matches
Current list: https://gist.github.com/J08nY/9162e89a60f7381a381d1c1c4e1341c4 MongoDB command:
db.cc.find({"heuristics.cert_id": null}, {_id: 1})