EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

predict a single class for all bases in a non-repeat subsequence #53

Open williamstark01 opened 1 year ago

williamstark01 commented 1 year ago

I notice that the model does a good job with predicting a repeat but struggles with replicating the sequence, here is the first parts of the subsequences of this sample prediction:

AGAACCTATTATTTGCATGA🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑TAGAAGAAACCTGTATTTTTTTCATCA
CGAAATTTATTATTTATATA🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑TAAAAAAAATTTATATTTTTTTTATTA

I realize that we don't need this functionality from the model, as we only need the absence of a repeat in these subsequences. Would it make sense then to predict a single additional class for bases in non-repeat subsequences, making the prediction and output of the model like this?

AGAACCTATTATTTGCATGA🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑TAGAAGAAACCTGTATTTTTTTCATCA
____________________🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑🥑___________________________

(Or any other character to represent the absence of a repeat.)

Would that be easy to test?