Our current design accepts several sets of sorted kmers as input and produces a sorted union of those kmer sets. I suggested to Alex that the colex re-sort in cosmo is not necessary for these kmers, as the data structure could be mirrored to operate on the 5' end instead of the 3' end. It would still be necessary to sort the reverse-complement of them, but we should be able to halve the amount of data we sort.
Alternately, the order of the kmers in the union is the same order as the colex sort of the reverse-complemented kmers if you assume a different order for the DNA alphabet. (i.e. The nibblet '00' is interpreted as the nucleotide 'A', but we could interpret that bit pattern as 'T', and if you define the alphabet order as being ['T', 'G', 'C', 'A' ] instead of ['A', 'C', 'G', 'T'] and treat the least significant bits as the 5' end instead of the 3' end, then it will be colex sorted reverse complement.) It would still be necessary to sort the sense strand version of the data.
This might be the easiest to implement. We could reverse the nibblets and change the code that deals with nibblet encoding and alphabet symbol order.
Our current design accepts several sets of sorted kmers as input and produces a sorted union of those kmer sets. I suggested to Alex that the colex re-sort in cosmo is not necessary for these kmers, as the data structure could be mirrored to operate on the 5' end instead of the 3' end. It would still be necessary to sort the reverse-complement of them, but we should be able to halve the amount of data we sort.
Alternately, the order of the kmers in the union is the same order as the colex sort of the reverse-complemented kmers if you assume a different order for the DNA alphabet. (i.e. The nibblet '00' is interpreted as the nucleotide 'A', but we could interpret that bit pattern as 'T', and if you define the alphabet order as being ['T', 'G', 'C', 'A' ] instead of ['A', 'C', 'G', 'T'] and treat the least significant bits as the 5' end instead of the 3' end, then it will be colex sorted reverse complement.) It would still be necessary to sort the sense strand version of the data.
This might be the easiest to implement. We could reverse the nibblets and change the code that deals with nibblet encoding and alphabet symbol order.