Arcadia-Science / ProteinCartography

a pipeline to build similarity maps of protein space
MIT License
30 stars 10 forks source link

Create feature matrix without using pivot table. #61

Closed neevor closed 1 year ago

neevor commented 1 year ago

This PR updates the code in foldseek_clustering.py to no longer use pandas pivot table feature to generate the similarity matrix.

Some notes:

Resolves Arcadia-Science/ProteinCartography#49

neevor commented 1 year ago

I noticed one difference in the output. The code is no longer opening with pandas (which will auto convert the data types) the output of this update would be values like 2.927E-01 instead of 0.2927. These are the original values from the input file and pandas can load these without issue as that's what was happening in the original approach, but it is a different output and I can change that if it would be an issue.

mezarque commented 1 year ago

I noticed one difference in the output. The code is no longer opening with pandas (which will auto convert the data types) the output of this update would be values like 2.927E-01 instead of 0.2927. These are the original values from the input file and pandas can load these without issue as that's what was happening in the original approach, but it is a different output and I can change that if it would be an issue.

This should probably be fine! I would recommend running the demo in the README to see if this set of changes causes any issues (and in the future we should definitely set up automated testing)

mezarque commented 1 year ago

The ordering of the lines and columns of the similarity matrix might differ but the content should be the same.

Could you take two sorted DataFrames and compare their contents to ensure all the contents are correct?

e.g. via pandas.DataFrame.diff

neevor commented 1 year ago

The ordering of the lines and columns of the similarity matrix might differ but the content should be the same.

Could you take two sorted DataFrames and compare their contents to ensure all the contents are correct?

e.g. via pandas.DataFrame.diff

Yeah I ran with several different subsamples of the real data so that the original code could run and compare and for all test cases the dataframes from both the original run and the new runs were a match.

neevor commented 1 year ago

Ok, I believe I updated and addressed the comments. I will be focused on running the demo set now to make sure that is all good.