Create feature matrix without using pivot table.

neevor commented 1 year ago

This PR updates the code in foldseek_clustering.py to no longer use pandas pivot table feature to generate the similarity matrix.

Some notes:

I am using a static value of 100 for the chunksize when working in parallel. This value did a good job of keeping the serializing overhead down and the workers processing pretty consistently, but it's not currently adjustable which might be desired.
This will default to using all of the available cores on the machine it's running on. This is settable but obviously since it's new the rest of the pipeline doesn't set any specific value. I think this is ok for a snakemake workflow as you usually want to maximally use resources.
The pivot_foldseek_results no longer returns the data frame at the end. I searched the code and didn't see anywhere that the return value was being used so I think that is ok, but let me know if that change will interfere with anything. Note that by not returning the data frame a large amount of memory is never used as it's a large data frame.
The ordering of the lines and columns of the similarity matrix might differ but the content should be the same.

Resolves Arcadia-Science/ProteinCartography#49

neevor commented 1 year ago

I noticed one difference in the output. The code is no longer opening with pandas (which will auto convert the data types) the output of this update would be values like 2.927E-01 instead of 0.2927. These are the original values from the input file and pandas can load these without issue as that's what was happening in the original approach, but it is a different output and I can change that if it would be an issue.

mezarque commented 1 year ago

I noticed one difference in the output. The code is no longer opening with pandas (which will auto convert the data types) the output of this update would be values like 2.927E-01 instead of 0.2927. These are the original values from the input file and pandas can load these without issue as that's what was happening in the original approach, but it is a different output and I can change that if it would be an issue.

This should probably be fine! I would recommend running the demo in the README to see if this set of changes causes any issues (and in the future we should definitely set up automated testing)

mezarque commented 1 year ago

The ordering of the lines and columns of the similarity matrix might differ but the content should be the same.

Could you take two sorted DataFrames and compare their contents to ensure all the contents are correct?

e.g. via pandas.DataFrame.diff

neevor commented 1 year ago

The ordering of the lines and columns of the similarity matrix might differ but the content should be the same.

Could you take two sorted DataFrames and compare their contents to ensure all the contents are correct?

e.g. via pandas.DataFrame.diff

Yeah I ran with several different subsamples of the real data so that the original code could run and compare and for all test cases the dataframes from both the original run and the new runs were a match.

neevor commented 1 year ago

Ok, I believe I updated and addressed the comments. I will be focused on running the demo set now to make sure that is all good.

Arcadia-Science / ProteinCartography

Create feature matrix without using pivot table. #61