Arcadia-Science / ProteinCartography

a pipeline to build similarity maps of protein space
MIT License
27 stars 10 forks source link

Proteins duplicated in the results in "From Query" version of the pipeline #38

Closed braebigge closed 1 year ago

braebigge commented 1 year ago

When starting the pipeline from an input folder containing query proteins (which are then used to search for hits), it gives results where certain proteins are duplicated many times. This seems to be happening later in the pipeline. There are not duplicates in the all_by_all.tsv files, in the leiden_features.tsv files or the strucluster.tsv files, in the list of PDB files, or in any of the other steps. The only files I've found that have duplicates are the aggregated_features.tsv files. I'm not sure if the html plots themselves have duplicates.

To get past this, I pulled all the pdb files and the uniprot features file into a separate folder and ran the "from folder" version of the pipeline, and this worked to get rid of the duplicates. So, this duplication is only happening in the "from query" version of the pipeline.

mezarque commented 1 year ago

This should be resolved as part of #39; this was partially due to pagination issues in query_uniprot.py and partially due to different TM-score or e-values for shared hits across Foldseek databases in extract_foldseekhits.py. We now use the lowest e-value for the same hit across databases by default.