Closed braebigge closed 1 year ago
This should be resolved as part of #39; this was partially due to pagination issues in query_uniprot.py
and partially due to different TM-score or e-values for shared hits across Foldseek databases in extract_foldseekhits.py
. We now use the lowest e-value for the same hit across databases by default.
When starting the pipeline from an input folder containing query proteins (which are then used to search for hits), it gives results where certain proteins are duplicated many times. This seems to be happening later in the pipeline. There are not duplicates in the all_by_all.tsv files, in the leiden_features.tsv files or the strucluster.tsv files, in the list of PDB files, or in any of the other steps. The only files I've found that have duplicates are the aggregated_features.tsv files. I'm not sure if the html plots themselves have duplicates.
To get past this, I pulled all the pdb files and the uniprot features file into a separate folder and ran the "from folder" version of the pipeline, and this worked to get rid of the duplicates. So, this duplication is only happening in the "from query" version of the pipeline.