debbiemarkslab / EVcouplings

Evolutionary couplings from protein and RNA sequence alignments
http://evcouplings.org
Other
235 stars 76 forks source link

Possible inconsistency in CSV export of secondary struct #98

Closed kpgbrock closed 6 years ago

kpgbrock commented 6 years ago

In this function ( https://github.com/debbiemarkslab/EVcouplings/blob/master/evcouplings/visualize/pairs.py#L752 ), a column with id needs to be defined, but in sec_struct.csv the column's header is just i. Maybe inconsistent?

thomashopf commented 6 years ago

The distinction between id and i in general is intentional - the former is a residue identifier, the latter is a position in a sequence.

DSSP-extracted secondary structure is an example of the first (even though in our setting id == i since the structures are remapped into actual sequence indices using SIFTS). Secondary structure predictions are always an example of the second, which is why the CSV has i as column header.

The way the plotting function is currently implemented is so one can simply plonk the residues dataframe associated with a PDB structure object into the plotting function, even though the plotting function assumes that the identifiers are actually sequence indices.

One could argue that the function should expect a dataframe with column i instead of id to explicitly force the user to think about the conversion by renaming the column, but then that feels like a hassle to me because you have to rename every time you plot an experimental structure. Alternatively one could look for existence of i or id.

So not a bug, and I would just leave design as is for now since it doesn't have any real consequences for users.

sacdallago commented 6 years ago

Hi @thomashopf ,

fair enough, a note somewhere of this is though worth. I guess that if someone was to look at the issues an finds this, it's good enough. I'll also upload @kpgbrock 's notebook to the notebooks folder in a PR soon