ContactFrequency to pandas dataframe

PeptideSimulator01 commented 4 years ago

Dear all,

I really like using ContactMap Explorer! I was wandering if there is a way to create a pandas dataframe out of a ContactFrequency calculation.

Like:

traj=mdt.load('trajectory.h5')
peptide = topology.select('chainid 1 and element != "H"')
protein = topology.select('chainid 0 and resid 77 or resid 124 or resid 130 or resid 137 and element != "H"')
contacts = ContactFrequency(traj, query=peptide, haystack=protein, cutoff=0.45)
df=pd.DataFrame(contacts)

where the query would be the columns and the haystack the rows or the other way round, doesn't matter.

Thanks for your effort! Have a nice day.

dwhswenson commented 4 years ago

First, keep in mind that the contact frequency itself isn't the contact matrix, because it holds information about both the atom-atom contact matrix and about the residue-residue contact matrix.

Assuming you want residue-residue contacts, you can (in principle) do this with:

df = contacts.residue_contacts.df

HOWEVER, pandas 1.0 (released around Feb 1) changed a bunch of stuff, including the way to read in sparse matrices (we store the contact matrices as sparse matrices). When I tried to update Contact Map Explorer to work with pandas 1.0, I found that the sparse matrix implementation in pandas had a bug: it gave completely incorrect results in some cases. I think this has since been fixed in pandas (https://github.com/pandas-dev/pandas/issues/29814), but I haven't gotten around to updating the PR here (#69).

So to convert to pandas, you need to use a version of pandas older than 1.0 (for now). I believe that pandas==0.25.3 was their last version prior to the 1.0 release.

PeptideSimulator01 commented 4 years ago

Thank you, I will downgrade my pandas and then try again. In the meantime, is there a way to get the contacts.residue_contacts as a np.array and then convert this into a pandas dataframe?

dwhswenson commented 4 years ago

I think the way to get there would be to go through a scipy.sparse matrix; something like

residue_contacts.sparse_matrix.toarray()

should do it. Be aware; I think all of this returns a square matrix that is the size of the full number of residues in the topology, but everything will be zero except the entries where query meets haystack (that is also true for the .df property, which first creates the scipy sparse matrix and then sends that to pandas as input for a dataframe.)

PeptideSimulator01 commented 4 years ago

it worked with pandas 0.25.3 and I am checking now if the values are correct. And also yes, this gives a square matrix with all the contacts, but I think this can be handled with some pandas remodelling.

sroet commented 4 years ago

Since the release of contact-map v0.5.1, this should also work with pandas > 1.0.

@PeptideSimulator01 was there anything left before closing this issue?

PeptideSimulator01 commented 4 years ago

Thanks a lot @sroet ! Actually, there is one last thing, which is how to cite your work. I didn't found a proper citation for the Contact Map Explorer. Let me know if you have citation, which suits you. Would be glad to honor your effort.

dwhswenson commented 4 years ago

Thanks @PeptideSimulator01. We're working on this! I've gotten a few questions about this lately. I think we'll set up Zenodo DOIs for releases in the short term, which technically makes the code citable. (@sroet is looking into that now.) We're planning to write a paper for peer review as we finish up a few more things that would make it more attractive for publication.

dwhswenson / contact_map

ContactFrequency to pandas dataframe #74