Open conchoecia opened 1 year ago
One problem with doing this is the potential to use the wrong copy of the protein (if there truly are duplicates), given the reciprocal best hit approach. This will likely need to be addressed differently later with the use of orthogroup-type approaches.
See also #49
use cd-hit to find duplicate protein substrings in the input protein files
use this for "best" filtering option
needs these files: cd-hit cdhit.c++ cdhit-common.h cdhit-common.o cdhit.o cdhit-utility.o Makefile license.txt