bcgsc / btl_bloomfilter

The BTL C/C++ Common bloom filters for bioinformatics projects, as well as any APIs created for other programming languages.
GNU General Public License v3.0
18 stars 4 forks source link

File header #23

Closed jwcodee closed 5 years ago

jwcodee commented 5 years ago

File Header implementation as discussed in the BTL meeting. This header utilize a header library from https://github.com/skystrife/cpptoml. Sample header:

[BTLCountingBloomFilter_v1]
    BloomFilterSize = 10737418240
    HashNum = 4
    KmerSize = 88
    BitsPerCounter = 8
[HeaderEnd]
JustinChu commented 5 years ago

Are we scraping keeping the magic string 8 bytes at the start of the file? They are typically 8 bytes in length. We could make it commented in toml format eg.

#BTL_CBF
[HeaderStart]
    BloomFilterSize = 10737418240
    HashNum = 4
    KmerSize = 88
    BitsPerCounter = 8
[HeaderEnd]

At the very least I think we shoould have a fixed length magic str.

JustinChu commented 5 years ago

Actually based on this list, I take back the thing about 8bytes. However I still strongly feel that it should be a fixed length and perhaps be a commented line in the toml format.

jwcodee commented 5 years ago

From what I can tell, the serializer does not support comments. I'll have to manually add line at the top in ASCI which isn't difficult, but I dont think a fixed length is necessary. There is a new static constexpr const char* member in the class that is "BTLCountingBloomFilter_v1". If you change that member which you will if you make a new version, the readHeader function will exit if the member magic string doesn't match the one in the bloom filter file.

vlad0x00 commented 5 years ago

Actually based on this list, I take back the thing about 8bytes. However I still strongly feel that it should be a fixed length and perhaps be a commented line in the toml format.

The starting bytes of the Bloom filter would always be "[BTLCountingBloomFilter_" so you can consider that the fixed size magic string. After that comes the version which makes the rest of the magic string unique. The reason why version is part of the section name is because the TOML library Johnathan is using does not respect the order of the fields as he inputs them, so he can't make sure that the version field is the first one. This makes sense, I think, as we now put no restriction for the order of the fields.

sjackman commented 5 years ago

How about a magic string of simply [BTLCountingBloomFilter] and then a version field within? As long as we know that the format is TOML, it shouldn't matter that the version isn't the first field.

vlad0x00 commented 5 years ago

How about a magic string of simply [BTLCountingBloomFilter] and then a version field within? As long as we know that the format is TOML, it shouldn't matter that the version isn't the first field.

The issue with that is if we change the format in the future. If version is not the first field you read, you can't be 100% sure what to expect right after [BTLCountingBloomFilter] and how to parse it. If you read the version first, your code immediately knows whether it recognizes the format and how to proceed. Remember the issue with Bloom filters before this pull request - we couldn't remove or change fields because it would be backwards incompatible and code could fail by trying to read fields differently than intended.

Another nice thing about having version in the magic string is that it makes the start of the Bloom filter unique, depending on the version. This makes it harder for a code that reads Bloom filter header to accidentally proceed by simply seeing [BTLCountingBloomFilter] and not checking the version field.

sjackman commented 5 years ago

If you're using a header with a defined container format (like TOML) rather than a binary blob, it's easier to support future versions of code being able to support parsing multiple versions of the file format (which a nice feature to have). That's a bit harder to do if the magic version changes. Since you know the format of the header is TOML, you don't need to know the version to parse it, which is the big advantage of using a defined container format. You still have the option of adding a _v2 suffix to the magic in the future if you changed the container format in the future, which is the only reason that I can think of to need to bump the magic string. The difference with a binary blob is that you need to know the version to be able to parse the header.

vlad0x00 commented 5 years ago

If you're using a header with a defined container format (like TOML) rather than a binary blob, it's easier to support future versions of code being able to support parsing multiple versions of the file format (which a nice feature to have). That's a bit harder to do if the magic version changes. Since you know the format of the header is TOML, you don't need to know the version to parse it, which is the big advantage of using a defined container format. You still have the option of adding a _v2 suffix to the magic in the future if you changed the container format in the future, which is the only reason that I can think of to need to bump the magic string. The difference with a binary blob is that you need to know the version to be able to parse the header.

I feel like both of our approaches are justified and at this point we might be overengineering. :D The only thing that concerns me is that, despite the fact that we know the header is in TOML format, the TOML format itself might change in the future and new syntax might be added, making parsers interpret it differently. Although this is unlikely, I feel like a robust header that would permit any standard change would be pretty cool.

sjackman commented 5 years ago

Agreed. We're bike shedding at this point. https://en.wikipedia.org/wiki/Law_of_triviality However it's implemented, if the header version does increase, I'd recommend future versions of the code be able to parse both the old version and the new version.

jwcodee commented 5 years ago

So I think this is ready to ship now unless there are additional changes requested. Should I change the configure.ac to 1.2.0 in this PR or another one?

jwcodee commented 5 years ago

hmm should I move catch.hpp to vendor too in this PR?

vlad0x00 commented 5 years ago

hmm should I move catch.hpp to vendor too in this PR?

Ideally, since that's not a change relevant to the file header, I would do it in a separate pull request.

JustinChu commented 5 years ago

So I think this is ready to ship now unless there are additional changes requested. Should I change the configure.ac to 1.2.0 in this PR or another one?

I don't think it really matters, so long as the version number change is in its own commit. Is this the final PR we will be releasing as this version? If so, you can change the version number.

jwcodee commented 5 years ago

I'll add a credits header at the top of CountingBloomFilter.hpp and that is all there is for me with regards to this PR. I didn't push it yet because I dont want to use up our CI minutes with just that change if we are also going to do a few more things on this PR