aboutcode-org / commoncode

A library of common functions shared in many other AboutCode projects
3 stars 11 forks source link

fingerprint attribute in VirtualCodebase #15

Closed Pratikrocks closed 3 years ago

Pratikrocks commented 3 years ago

Signed-off-by: Pratik Dey pratikrocks.dey11@gmail.com

Issue: #12

Virtual Codebase can now scan for files having fingerprint Attribute within it gentle ping : @majurg @pombredanne

Pratikrocks commented 3 years ago

@pombredanne , please look into this too

pombredanne commented 3 years ago

@Pratikrocks @majurg sorry for the late reply. I am still thinking about this one because of "fingerprint" feels too generic as a name.

Pratikrocks commented 3 years ago

OR use fingerprints and make it a list of checksums and fingerprints with a prefix such as ['sha1-23462348623486', 'dcfp1-45345345345345'] OR use fingerprints and make it a mapping of checksums and fingerprints name/value pairs such as {'sha1': '23462348623486', 'dcfp1': '45345345345345'}

@pombredanne we are already having an attribute for the sha in VirtualCodebase so IMO if we again club sha into the fingerprint , it would be a bit more redundant thing

JonoYang commented 3 years ago

@pombredanne I think the field should be a mapping of {checksum name: checksum value}. This makes it easier to store and retrieve multiple fingerprint values in a sane way. I don't think it is efficient to have the fingerprints in a list where we have to parse the fingerprint string for the fingerprint type, then remove the type before we can use the fingerprint value. Also, the new fingerprint attribute should be renamed fingerprints to reflect that there could be multiple types of fingerprints for a file.

Pratikrocks commented 3 years ago

@JonoYang currently we are dealing with a single fingerprint (which is the fingerprint plugin), upon using this plugin only a single unique fingerprint would be generated always, for a file.

pombredanne commented 3 years ago

@JonoYang I like the switch to "fingerprints" as a mapping @Pratikrocks we will eventually deprecate sha1/md5 and so on and move these under fingerprints. There will be surely other fingerprints too, so we do not have a single attribute there. Also the "fingerprint" used in DeltaCode needs to be given a unique and distinctive name

I kinda see checksums as a case of fingerprints

See https://en.wikipedia.org/wiki/Fingerprint_(computing) and https://csrc.nist.gov/glossary/term/Digital_Fingerprint

Pratikrocks commented 3 years ago

@pombredanne, the fingerprint which we are having currently is using SimHash Algorithm, and its generated only as a plugin in Scancode. And the sha1/md5 has its own algorithm to generate the hash. And we are using the fingerprint plugin for the similarity calculations.

pombredanne commented 3 years ago

@Pratikrocks re:

the fingerprint which we are having currently is using SimHash Algorithm, and its generated only as a plugin in Scancode. And the sha1/md5 has its own algorithm to generate the hash. And we are using the fingerprint plugin for the similarity calculations.

I get this. I am just saying that fingerprint is too generic as a term and at the same super-specific to a plugin and therefore I would not want to add this as a standard resource attribute. It can be a plugin-contributed attribute alright, but that still makes it aname that is too generic.

Overall I would rather prefer that we change the API and store checksums as a list of name/value pairs; and that we find a good name for the deltacode "fingerprint", may be something like a deltasim1 or something TBD.

Pratikrocks commented 3 years ago

Yes @pombredanne I get your point :)