amnh / PCG

𝙋𝙝𝙮𝙡𝙤𝙜𝙚𝙣𝙚𝙩𝙞𝙘 𝘾𝙤𝙢𝙥𝙤𝙣𝙚𝙣𝙩 𝙂𝙧𝙖𝙥𝙝 ⸺ Haskell program and libraries for general phylogenetic graph search
28 stars 1 forks source link

getSubsetIndex bug #165

Closed Boarders closed 4 years ago

Boarders commented 4 years ago

Currently on master several of the tests to do with dynamic character decoding fail (with some throwing exceptions). I eventually tracked down that this is an error in the unsorted branch of the getSubsetIndex code. In particular I added the following test:

subsetIndex :: TestTree
subsetIndex =
  testGroup "Subset Index Tests:"
      [ HU.testCase "getSortedLookup agrees for sorted and unsorted alphabets" sortedLookup
      ]
  where
    alphabet :: Alphabet String
    alphabet = fromSymbols ["0","1","2"]

    sortedAlphabet :: Alphabet String
    sortedAlphabet = alphabet {isSorted = True}

    sortedLookup :: Assertion
    sortedLookup =
     (getSubsetIndex alphabet (Set.singleton "1"))
      @?=
     (getSubsetIndex sortedAlphabet (Set.singleton "1"))

This gives the following:

    Subset Index Tests:
      getSortedLookup agrees for sorted and unsorted alphabets:                      FAIL
        lib/alphabet/test/Data/Alphabet/Test.hs:158:
        expected: 2
         but got: 0

For now I am going to fix the error in the unsorted function but it is probably worth moving towards having an invariant that all alphabets are always sorted and deleting code that doesn't assume otherwise.

I should note that using only sorted alphabets fixes all the tests from core:data-structures to pass.

recursion-ninja commented 4 years ago

Can't sort all alphabets.

Alphabets of "additive characters" must remain in their input order (likely unsorted). Otherwise the additive TCMs would be transformed from a useful structure to a rather useless one, and the character would then have to be treated as a "metric character."

Boarders commented 4 years ago

Ok, I have a fix for the unsorted code so I will merge that.

Boarders commented 4 years ago

This is hopefully now fixed here

I don't know if this is most efficient but there is at least some pretty extensive testing in place here to ensure that any optimisations remain correct.