Proteins duplicated in the results in "From Query" version of the pipeline

When starting the pipeline from an input folder containing query proteins (which are then used to search for hits), it gives results where certain proteins are duplicated many times. This seems to be happening later in the pipeline. There are not duplicates in the all_by_all.tsv files, in the leiden_features.tsv files or the strucluster.tsv files, in the list of PDB files, or in any of the other steps. The only files I've found that have duplicates are the aggregated_features.tsv files. I'm not sure if the html plots themselves have duplicates.

To get past this, I pulled all the pdb files and the uniprot features file into a separate folder and ran the "from folder" version of the pipeline, and this worked to get rid of the duplicates. So, this duplication is only happening in the "from query" version of the pipeline.

Arcadia-Science / ProteinCartography

Proteins duplicated in the results in "From Query" version of the pipeline #38