EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

Simple data analysis about repeat sequence. #22

Closed yangtcai closed 2 years ago

yangtcai commented 2 years ago

There is a big issue with our repeat sequence datasets. I did a quick analysis on chr1, I found there will produce 165970 segments of 2k length, but only 6442 segments that have repeat class annotations.

image

Should we drop the sample which has no repeat class annotations? and also I think those samples are not false samples, as It will not contribute to the gradient in training, does my understanding is correct?

williamstark01 commented 2 years ago

I think you are right about the gradient not being affected by sequences that don't include repeats we are training for, good observation.

Something related to that, how do you think it's best to handle partially included repeats? Currently we are removing them completely, but if a repeat is included for the most part but not completely in a sequence we are using for training, we are actually providing the wrong label to the network.

        repeats_in_sequence = anno_df.loc[
            (anno_df["start"] >= start)
            & (anno_df["end"] <= end)
            & (anno_df["start"] < anno_df["end"])
        ]
yangtcai commented 2 years ago

I think it's another big issue we should consider, I did not realize it before you pointed it out. The labels are crucial for generating our training datasets, I will fix the wrong label as soon as possible. The solution should be to save all the repeat sequences, just like the following picture. Black lines mean two segments and the red line means repeat sequence, if the segments of 2k contain it, we will split the original repeat into two parts.

image

What do you think about this?

williamstark01 commented 2 years ago

Nice description and I think that's a good solution, let's handle it this way.