bcgsc / btl_bloomfilter

The BTL C/C++ Common bloom filters for bioinformatics projects, as well as any APIs created for other programming languages.
GNU General Public License v3.0
18 stars 4 forks source link

Remove unnecessary variables from bloom filter headers and revise Bloom filter file format spec in Readme #14

Closed JustinChu closed 5 years ago

JustinChu commented 5 years ago

Variables dFPR, nEntry, and tEntry in various header files might not be needed. We should remove them and make sure the readme reflects the changes.

jwcodee commented 5 years ago

15

The variables are dealt with in the above pull request. I will modify the readme in a subsequent pull request

sjackman commented 5 years ago

Fields cannot be removed from the header. You need to consider what happens when an older version of code reads a file created by a newer version of code and vice versa. If the fields are unused, you can rename them to unused1, unused2, and unused3. If these fields are removed, the location of the version field in the header would move, which would break all the things.

sjackman commented 5 years ago

If you new fields are added to the header, they must be appended to the end of the header.

JustinChu commented 5 years ago

They can be removed frome the counting bloom filter at least, considering that we haven't been using this datastructure for much anyway

sjackman commented 5 years ago

I would like the standard Bloom filter and the counting Bloom filter to use the same file format for serialization. Please factor FileHeader out into a common header and include it in both BloomFilter.hpp and CountingBloomFilter.hpp.

sjackman commented 5 years ago

Please do not remove the unused fields to retain binary compatibility with previous releases.

JustinChu commented 5 years ago

The MIBloomFilter.hpp also has a header. But it differs and has the ability to add spaced seeds to the header. I don't think we enforce the same format requirements. These data structures have different requirements.

sjackman commented 5 years ago

We need to ensure that accidentally loading a non-counting Bloom filter, counting Bloom filter, or MIBF Bloom filter with the wrong code doesn't silently do crazy things.

sjackman commented 5 years ago

The easiest way to do that is to share one common file format. It's not the only way, but it's the easiest way.

JustinChu commented 5 years ago

The initial check for the file type (magic number) should indicate the procedure to handle these types.

sjackman commented 5 years ago

Ah, MIBloomFilter.hpp has a different magic number. https://github.com/bcgsc/btl_bloomfilter/blob/4938b4a0f98cef8cf41778b46d4c4e80608e2eca/MIBloomFilter.hpp#L218 That one can do whatever it likes then. Non-counting and counting Bloom filters can share a magic number and file format. A non-counting Bloom filter is simply a counting Bloom filter with bitsPerCounter=1.

JustinChu commented 5 years ago

I guess the point I'm trying to make is it doesn't make sense to me that the bloom filters and counting bloomfilters should have the same magic number. Also, the same check for file type should be added in both.

If we were to go back in time. we should have had the header as follows: magic number -> header size -> version number -> etc...

This would have given us more flexiability in header changes without doing weird things between versions.

sjackman commented 5 years ago

If we were to go back in time. we should have had the header as follows: magic number -> header size -> version number -> etc... This would have given us more flexiability in header changes without doing weird things between versions.

Yes, I agree, but we have the format we have now. Unless we bump the magic string again and make a new format.

I guess the point I'm trying to make is it doesn't make sense to me that the bloom filters and counting bloomfilters should have the same magic number. Also, the same check for file type should be added in both.

I think counting Bloom filters an non-counting Bloom filters have way more in common than differences. Like I said above, a non-counting Bloom filter is simply a counting Bloom filter with bitsPerCounter=1.