Existing TCR Levenshtein code raises an exception when trying to parse TRAV40*01
This is indeed a functional TRAV allele as confirmed by current IMGT data
However IMGT data shows that this allele has a trivial CDR2 sequence of length 0
This means that in the IMGT database the entry for TRAV40*01 is missing the CDR2 field
Because of this the current code raises an exception when attempting to parse TRAV40*01 saying that a CDR2 sequence could not be found
This is now handled by returning an empty string when encountering any V genes that are confirmed functional but are missing certain fields (TRAV40*01 is the only one I know of for now though).
This new edge case is now tested for in the unit tests
Wrapper around RapidFuzz levenshtein API has been revised to optimise for speed whenever possible
RapidFuzz provides option to utilize multiple cores when available, this is now used
My calls to the RapidFuzz C++ API were fully wrapping them with a python callable before, which prevented RapidFuzz from fully utilising the C++ back end to release the Python GIL when performing cdist calculations. This is still unfortunately the case if you want to use custom weights for insertions deletions and substitutions, but if they are all kep as default (1) then the calls are made directly to the C++ API which makes everything blazingly fast 😎
@jhenderson0 you were saying the new implementation was slow- this is for you 🎉
Existing TCR Levenshtein code raises an exception when trying to parse TRAV40*01
Wrapper around RapidFuzz levenshtein API has been revised to optimise for speed whenever possible