It might be good style to rework the .bfm file format. Problems I see:
under !bonds the monomer IDs are stored with the smallest being one, but internally the smallest ID is 0, which results in this unpretty line: this->getDestination().modifyMolecules().connect(a-1,b-1);
Under !mcs the lines begin with Ascii-encoded and space separated numbers followed by binary encoded data. It might be more consistent to either use only Ascii encoding or only binary encoding
The binary encoding under !mcs encodes one number per one char (`), thereby limiting the range to [-128,127] which might be a problem in the future for larger bond sets. Actually the highest ID in an example file is already124` and therefore quite close to the maximum range. Also as it is used as a signed ID half of the range is unusable.
For files containing edge attributes these attributes take up echo $(( $( 'sed' -nr '/!attributes/,/!mcs/p' -- "$fname" | 'wc' -c ) * 100 / $( cat -- "$fname" | 'wc' -c ) ))% = 93% in an example with roughly 500k monomers. The current format is 515172-515172:2, which including the newline takes up 16 bytes per entry. Depending on the int type used (32bit here) for the monomer IDs and for the attribute (1 byte here) this might be reduced to 9 bytes per entry. Or to 5 bytes when only the monomer ID is given instead of setting the attribute for a virtual self loop.
As attributes require a connection, the !attribute information already includes the !bonds information and should maybe be merged. The !attribute data entry might then be used solely for monomer attributes / self loop attributes with the discussed 5 byte per entry. This would reduce the above discussed example file from 8MB to roughly 3MB. This is still quite large compared to how this only adds a 1 byte attribute tag for each monomer all the while the whole position data of all monomers take up only 500kB. This is because no monomer ID needs to be written down for the position data, it instead is implied.
Each line after !mcs denotes a linear chain and triggers connections being made between these two monomers. The !bonds already look like they contain all bonds, so it might not be obvious, that additional data from !mcs is needed to get the full connectome, although it does reduce the file size.
When using full binary encoding for the file, it won't be humanly readable / checkable anymore, but it would be insanely easy to read e.g. the header by defining a struct. This would be similar to how the Bitmap file header can be read: fread(&bitmapFileHeader, sizeof(BITMAPFILEHEADER),1,filePtr);. Custom header lines are problematic to implement in binary as the name of the command is needed any way.
some of these suggestions violate the idea of a human readable, still compressed file format. For Version 2.2. we might cherry pick some suggestions and omit the remaining ones
It might be good style to rework the .bfm file format. Problems I see:
!bonds
the monomer IDs are stored with the smallest being one, but internally the smallest ID is 0, which results in this unpretty line:this->getDestination().modifyMolecules().connect(a-1,b-1);
!mcs
the lines begin with Ascii-encoded and space separated numbers followed by binary encoded data. It might be more consistent to either use only Ascii encoding or only binary encoding!mcs
encodes one number per one char (`), thereby limiting the range to [-128,127] which might be a problem in the future for larger bond sets. Actually the highest ID in an example file is already
124` and therefore quite close to the maximum range. Also as it is used as a signed ID half of the range is unusable.echo $(( $( 'sed' -nr '/!attributes/,/!mcs/p' -- "$fname" | 'wc' -c ) * 100 / $( cat -- "$fname" | 'wc' -c ) ))%
=93%
in an example with roughly 500k monomers. The current format is515172-515172:2
, which including the newline takes up 16 bytes per entry. Depending on the int type used (32bit here) for the monomer IDs and for the attribute (1 byte here) this might be reduced to 9 bytes per entry. Or to 5 bytes when only the monomer ID is given instead of setting the attribute for a virtual self loop.!attribute
information already includes the!bonds
information and should maybe be merged. The!attribute
data entry might then be used solely for monomer attributes / self loop attributes with the discussed 5 byte per entry. This would reduce the above discussed example file from 8MB to roughly 3MB. This is still quite large compared to how this only adds a 1 byte attribute tag for each monomer all the while the whole position data of all monomers take up only 500kB. This is because no monomer ID needs to be written down for the position data, it instead is implied.!mcs
denotes a linear chain and triggers connections being made between these two monomers. The!bonds
already look like they contain all bonds, so it might not be obvious, that additional data from!mcs
is needed to get the full connectome, although it does reduce the file size.When using full binary encoding for the file, it won't be humanly readable / checkable anymore, but it would be insanely easy to read e.g. the header by defining a struct. This would be similar to how the Bitmap file header can be read:
fread(&bitmapFileHeader, sizeof(BITMAPFILEHEADER),1,filePtr);
. Custom header lines are problematic to implement in binary as the name of the command is needed any way.