Fuzzy Hashes - Githubissues

Amerlander commented 3 years ago

In most test cases we do not need a perfect md5 hash. It mostly does not matter if there is a line more or less. Changing the hashing from md5 to some fuzzy hash like SSDEEP or sdhash might help getting less false negative tests.

Amerlander commented 3 years ago

Until now I had no issues with the md5 hashes. All errors came from errors in the code or .data files. At the moment all tests are passing:

          Test ended: Wed 24 Feb 23:11:09 UTC 2021

          14 FINISHED
          14 PASSED
          0 NOT PASSED

rzbrk commented 3 years ago

Or we count the lines in both .data files (the .data file created during the test and the .data file from ./testcases/testfiles/), take the lowest number (lets name this variable lines_min), cut the first lines_minlines from both files using the command headand compare only this part with md5sum or cmp. All tools are available on every Linux system.

rzbrk commented 3 years ago

I established a bash function comp() as described above, see 4fb76aa. The first and second argument of this function are the .data files to compare against each other. The third argument is an allowed percentage of lines to differ. But the files are only allowed to differ at the end. If this percentage threshold is not met, the function returns 1 (files differ). If I would not have implemented this feature, a comparison of any .data file with an empty file would return 0 (equal).

I implemented the comp() function only for testcases 07a and 07b. In these two testcases we occasionally encounter a situation, where the actual output .data file versus the sample output .data file differ by one line at the end. I set ${allow_lines_differ_percent} to 10 (percent), which corresponds to around three lines that are allowed to differ.

This issue can be closed if #60 is merged.

calliope-edu / CalliopEO_AstroPi

Fuzzy Hashes #55