a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

Insertions always removed #173

Closed OliverT1 closed 2 years ago

OliverT1 commented 2 years ago

Even when changing config.insertions to True, insertions are always removed and not included as nodes in the generated graph.

This is due to the config.insertions paramater not being passed to process_dataframe() in graphs.py, meaning the default value of False is always used.

a-r-j commented 2 years ago

Thanks for the report and quick PR.

Out of interest, what's your usecase here? I've considered adding support for an optional entrypoint to select insertions of interest though they've never been something I've wanted in practice.

OliverT1 commented 2 years ago

I'm working on antibodies. A common numbering scheme is often used (e.g. IMGT ) as a form of sequence alignment. Due to the variability in CDR3 length there are often many insertions, put in position 111, as well as deletions. This also causes issues in peptide bond edge creation as well because the deletions mean there are non-consecutive residue numbers which are then not connected by the peptide bond function.

I've done a hacky way to get around this, as well as some other antibody specific functions such as only creating nodes for surface exposed residues in the CDR vicinity. Happy to add these in if there's any interest in it.

a-r-j commented 2 years ago

Ah, I see; that makes sense.

That'd be great! Happy to discuss. I've just added a tutorial notebook appling e(n)GNNs to antibody developability prediction here that you might be interested in.

OliverT1 commented 2 years ago

Great, I'll get in touch when I've wrapped up this project then. Thanks for sharing, very similar to what I'm doing! I'm also using the Satorras model.