EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

some updates #4

Closed williamstark01 closed 2 years ago

williamstark01 commented 2 years ago

more info at the commits

williamstark01 commented 2 years ago

Just a note, we are using this style guide for Python: https://google.github.io/styleguide/pyguide.html

yangtcai commented 2 years ago

Hi, @williamstark01, the updates are so great!!! I used the tqdm() before, however, it did not work well, your update help me know a lot hhh. BTW, do you notice the generate file of the generate_ref_fasta.py, many chromosomes include 'N' as the base of 'A', 'T', 'C', 'G, I wonder do we need to drop them or just use them as 'N' to encode? As for the document you mentioned I will add it today. I think we can use the cluster to do the next step.

williamstark01 commented 2 years ago

Hey Yantong, it's great that you are learning new things!

The process of getting you access to the cluster has been initialized, it shouldn't take long.

It's a good question how to handle masked regions of a genome with "N". The best approach is probably to leave them as they are for now and handle them at the stage of generating subsequences for training. We can encode "N" as a distinct token, but throw away the subsequences that contain too many "N"s, as they won't contain any useful information for the network to learn. It's the same as if we had a very large high definition photo and we were taking windows from it to generate training samples; if part of the photo was too blurry, we would throw away the windows from that part, but keep the windows that were blurry only at their edges. Does that make sense?