find duplicate protein substrings with cd-hit

conchoecia / odp

oxford dot plots

GNU General Public License v3.0

129 stars 9 forks source link

find duplicate protein substrings with cd-hit #45

Open conchoecia opened 1 year ago

conchoecia commented 1 year ago

use cd-hit to find duplicate protein substrings in the input protein files

use this for "best" filtering option

needs these files: cd-hit cdhit.c++ cdhit-common.h cdhit-common.o cdhit.o cdhit-utility.o Makefile license.txt

conchoecia commented 7 months ago

One problem with doing this is the potential to use the wrong copy of the protein (if there truly are duplicates), given the reciprocal best hit approach. This will likely need to be addressed differently later with the use of orthogroup-type approaches.

conchoecia commented 7 months ago