Kmer-File-Format / kff-cpp-api

A C++ API to read and write kff files
GNU Affero General Public License v3.0
9 stars 9 forks source link

need an example for storing counts above 255 #5

Closed rchikhi closed 3 years ago

rchikhi commented 3 years ago

Can you please show a small working example for storing large count values ? I tried the following

        // some questions about endianness here
        void u8from32 (uint8_t b[4], uint32_t u32)
        {
            b[0] = (uint8_t)u32;
            b[1] = (uint8_t)(u32>>=8);
            b[1] = 1; // just a test to see if counts above 255 will be correctly stored
            b[2] = (uint8_t)(u32>>=8);
            b[3] = (uint8_t)(u32>>=8);
        }

        encode_sequence(sequence.c_str(), encoded);
        u8from32(counts, sum);
        sr->write_compacted_sequence(encoded, _kmerSize, counts);

and yet, outstr only shows counts <= 255.

rchikhi commented 3 years ago

Furthermore, my kff file looks like:


$ hexdump 0-0.kff  |head -n 10
0000000 0001 001e 0000 7600 0003 0000 0000 0000
0000010 006b 001f 0000 0000 0000 616d 0078 ca00
0000020 3b9a 0000 0000 6164 6174 735f 7a69 0065
0000030 0004 0000 0000 0000 d072 3ef7 0100 0000
0000040 0000 0000 0000 0000 **2900 0001** 0100 0000
0000050 0000 0000 0000 0000 9301 0001 0100 0000

I added **'s to highlight where the first count (0x129) supposed to be encoded, yet, it is printed as 0x29

rchikhi commented 3 years ago

btw my global variables are

        Section_GV sgv(_outfile);
        sgv.write_var("k", _kmerSize);
        sgv.write_var("max", 1000000000L); //  ¯\_(ツ)_/¯
        sgv.write_var("data_size", 4); // DSK counts are stored as uint32_t
yoann-dufresne commented 3 years ago

I think that the problem is a due to outstr. There is a problem in data reading an outputing.

There is also another problem with undefined endianess. I am still askink myself on how to fix it. Imposing one of the two endianess or add it as a variable in the global variable section.

Maybe an optimization for your code: you set max to a huge number. It means that all the block can have that huge number kmer each. It also imply that the integer that is needed to store your kmer number in a block have to be 4 Bytes long (for each block). As far as I know, you store one kmer per block for now. So, you should write max=1. It will save the 4 bytes for the block size for each block.

yoann-dufresne commented 3 years ago

The error of outstr should be fixed. It was a hardcoded variable value problem.