Closed neevor closed 1 year ago
I noticed one difference in the output. The code is no longer opening with pandas (which will auto convert the data types) the output of this update would be values like 2.927E-01
instead of 0.2927
. These are the original values from the input file and pandas can load these without issue as that's what was happening in the original approach, but it is a different output and I can change that if it would be an issue.
I noticed one difference in the output. The code is no longer opening with pandas (which will auto convert the data types) the output of this update would be values like
2.927E-01
instead of0.2927
. These are the original values from the input file and pandas can load these without issue as that's what was happening in the original approach, but it is a different output and I can change that if it would be an issue.
This should probably be fine! I would recommend running the demo in the README to see if this set of changes causes any issues (and in the future we should definitely set up automated testing)
The ordering of the lines and columns of the similarity matrix might differ but the content should be the same.
Could you take two sorted DataFrames and compare their contents to ensure all the contents are correct?
e.g. via pandas.DataFrame.diff
The ordering of the lines and columns of the similarity matrix might differ but the content should be the same.
Could you take two sorted DataFrames and compare their contents to ensure all the contents are correct?
e.g. via pandas.DataFrame.diff
Yeah I ran with several different subsamples of the real data so that the original code could run and compare and for all test cases the dataframes from both the original run and the new runs were a match.
Ok, I believe I updated and addressed the comments. I will be focused on running the demo set now to make sure that is all good.
This PR updates the code in
foldseek_clustering.py
to no longer use pandas pivot table feature to generate the similarity matrix.Some notes:
pivot_foldseek_results
no longer returns the data frame at the end. I searched the code and didn't see anywhere that the return value was being used so I think that is ok, but let me know if that change will interfere with anything. Note that by not returning the data frame a large amount of memory is never used as it's a large data frame.Resolves Arcadia-Science/ProteinCartography#49