EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

Deal with datasets format #5

Closed yangtcai closed 2 years ago

yangtcai commented 2 years ago

Hi, @williamstark01, as our datasets have been generated, the format of the dataset looks like the following

sequence format: chr1: 1 - 100000
label format: chr1 start end subtype

However, it is not well suited for data loader, as labels don't match sequence, so I plan to change the datasets formats, the new format can be

sequence format: chr1: 1 - 100000
label format: chr1:1-100000 start end subtype

What do you think about that?

williamstark01 commented 2 years ago

We'll take a look at the data and code and come back to this tomorrow.

What other task could you work on until then? (We should create a TODO list with tasks, to make sure you are never blocked from doing work because you are waiting for feedback.)

williamstark01 commented 2 years ago

Handled in #17