EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

Dealing with forward- and reverse-strand in datasets #19

Open yangtcai opened 2 years ago

yangtcai commented 2 years ago

Hi, @williamstark01, when I implement normalizing the label, the labels are represented by a tuple(left, right), and the original sequence(seqstart: seqend)should convert into the range from 0 to 1. So, if in the tuple, the sequence should be ((left - seqstart) /  (seqstart - seqend), (right - seqend) / (seqstart - seqend)). In this procedure, we should concern with the double strands, the forward strand has a situation that seqstart < seqend, and the reverse strand has the opposite property. So, two types of strands will feed into our model, and our model will have to produce its own output types, it will require our model have the power of identifying the two types. It will be a burden for our model. We should promise our model there is only one type of strand will be the input. There is a workaround every reverse strand can be converted to a forward strand, so everything will be solved, also, to quickly prove our concept of DETR in biology, we can temporarily ignore the reverse strand. what do you think about this? 😃

williamstark01 commented 2 years ago

Hey Yantong, I think you are right. Forward and reverse is a property that the model doesn't need to know about the sequences. For prototyping the solution you propose sounds good, we can only use the forward strand initially. (And later on we can create a helper function to convert reverse strands and their coordinates as if they were forward strands, so we can include them in the training dataset and be able to run inference on them. This can be an issue to be added in the TODO list (just renamed it) in the Kanban board.)

williamstark01 commented 2 years ago

Actually this issue, adding it now.