EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

customizable training dataset #36

Closed williamstark01 closed 2 years ago

williamstark01 commented 2 years ago

Being able to use multiple chromosomes to include in the dataset will help us get closer to training a production model trained on all available data.

This could be done by creating a dictionary of {chromosome names: FASTA paths} in RepeatSequenceDataset, processing all of them, and merging their repeats.

The list of chromosomes to include would be a hyperparameter in this case.

yangtcai commented 2 years ago

Got it! :D