Closed Pratikrocks closed 3 years ago
@pombredanne , please look into this too
@Pratikrocks @majurg sorry for the late reply. I am still thinking about this one because of "fingerprint" feels too generic as a name.
fingerprints
and make it a list of checksums and fingerprints with a prefix such as ['sha1-23462348623486', 'dcfp1-45345345345345']
fingerprints
and make it a mapping of checksums and fingerprints name/value pairs such as {'sha1': '23462348623486', 'dcfp1': '45345345345345'}
OR use fingerprints and make it a list of checksums and fingerprints with a prefix such as ['sha1-23462348623486', 'dcfp1-45345345345345'] OR use fingerprints and make it a mapping of checksums and fingerprints name/value pairs such as {'sha1': '23462348623486', 'dcfp1': '45345345345345'}
@pombredanne we are already having an attribute for the sha
in VirtualCodebase so IMO if we again club sha into the fingerprint , it would be a bit more redundant thing
@pombredanne I think the field should be a mapping of {checksum name: checksum value}
. This makes it easier to store and retrieve multiple fingerprint values in a sane way. I don't think it is efficient to have the fingerprints in a list where we have to parse the fingerprint string for the fingerprint type, then remove the type before we can use the fingerprint value. Also, the new fingerprint
attribute should be renamed fingerprints
to reflect that there could be multiple types of fingerprints for a file.
@JonoYang currently we are dealing with a single fingerprint (which is the fingerprint plugin), upon using this plugin only a single unique fingerprint would be generated always, for a file.
@JonoYang I like the switch to "fingerprints" as a mapping @Pratikrocks we will eventually deprecate sha1/md5 and so on and move these under fingerprints. There will be surely other fingerprints too, so we do not have a single attribute there. Also the "fingerprint" used in DeltaCode needs to be given a unique and distinctive name
I kinda see checksums as a case of fingerprints
See https://en.wikipedia.org/wiki/Fingerprint_(computing) and https://csrc.nist.gov/glossary/term/Digital_Fingerprint
@pombredanne, the fingerprint
which we are having currently is using SimHash Algorithm, and its generated only as a plugin in Scancode
.
And the sha1/md5
has its own algorithm to generate the hash.
And we are using the fingerprint
plugin for the similarity calculations.
@Pratikrocks re:
the
fingerprint
which we are having currently is using SimHash Algorithm, and its generated only as a plugin inScancode
. And thesha1/md5
has its own algorithm to generate the hash. And we are using thefingerprint
plugin for the similarity calculations.
I get this. I am just saying that fingerprint is too generic as a term and at the same super-specific to a plugin and therefore I would not want to add this as a standard resource attribute. It can be a plugin-contributed attribute alright, but that still makes it aname that is too generic.
Overall I would rather prefer that we change the API and store checksums as a list of name/value pairs; and that we find a good name for the deltacode "fingerprint", may be something like a deltasim1
or something TBD.
Yes @pombredanne I get your point :)
Signed-off-by: Pratik Dey pratikrocks.dey11@gmail.com
Issue: #12
Virtual Codebase can now scan for files having fingerprint Attribute within it gentle ping : @majurg @pombredanne