Closed williamstark01 closed 2 years ago
Just a note, we are using this style guide for Python: https://google.github.io/styleguide/pyguide.html
Hi, @williamstark01, the updates are so great!!! I used the tqdm() before, however, it did not work well, your update help me know a lot hhh. BTW, do you notice the generate file of the generate_ref_fasta.py, many chromosomes include 'N' as the base of 'A', 'T', 'C', 'G, I wonder do we need to drop them or just use them as 'N' to encode? As for the document you mentioned I will add it today. I think we can use the cluster to do the next step.
Hey Yantong, it's great that you are learning new things!
The process of getting you access to the cluster has been initialized, it shouldn't take long.
It's a good question how to handle masked regions of a genome with "N". The best approach is probably to leave them as they are for now and handle them at the stage of generating subsequences for training. We can encode "N" as a distinct token, but throw away the subsequences that contain too many "N"s, as they won't contain any useful information for the network to learn. It's the same as if we had a very large high definition photo and we were taking windows from it to generate training samples; if part of the photo was too blurry, we would throw away the windows from that part, but keep the windows that were blurry only at their edges. Does that make sense?
more info at the commits