Benjamin-Lee / deep-rules

Ten Quick Tips for Deep Learning in Biology
https://benjamin-lee.github.io/deep-rules/
Other
227 stars 45 forks source link

Biological sequence data is related in complex ways and this makes validation difficult #202

Closed jgreener64 closed 4 years ago

jgreener64 commented 5 years ago

This is a cool project and the draft is looking nice.

I have one thing I would add, probably to "Tip 7: Address deep neural networks' increased tendency to overfit the dataset". I'm mentioning it here for discussion, and if wanted I would be happy to add some text describing it.

When splitting a dataset of biological sequences or structures, care should be taken that there is no evolutionary relationship between sequences in the training set, sequences in the validation set and sequences in the test set. Many people split proteins into datasets using a threshold of 30% sequence identity, i.e. the training and validation sets will not share any sequence that is 30% or more similar. However, it is known that many proteins share homology down to effectively 0% sequence identity - see Figure 1 of Chothia and Lesk 1986 for example.

Poor dataset splitting means that the method being benchmarked appears to have better performance than it does, as one is partially measuring an ability to detect homologs. This problem affects protein secondary structure prediction, tertiary contact prediction, protein design studies, in fact almost anywhere protein data is used for machine learning. One way round it is to use databases such as CATH and ECOD to split sequences based on structural and evolutionary relationships.

Tagging @Benjamin-Lee as I think he drafted this section.

agitter commented 5 years ago

@jgreener64 this should definitely be covered in one of the tips. The commentary in #203 discusses this too and gives references to other datasets where this is a problem in addition to protein sequences (e.g. gene networks). #190 is also related and links a paper describing evaluation in biochemistry.

Would you like to take a pass at drafting this text? The project has been dormant and could use new engaged contributors.

jgreener64 commented 5 years ago

Great, I'll have a go at drafting some text.

Benjamin-Lee commented 4 years ago

@jgreener64 sorry to bump this back up but are you still interested in drafting text? If not, I can try to add in a mention of this.

jgreener64 commented 4 years ago

I don't think I'll have time to draft any text on this, feel free to mention it however you like of course.