Generate sequence and annotation datasets

yangtcai commented 2 years ago

@williamstark01 PTAL :D

williamstark01 commented 2 years ago

Good work!

Could you please add a description of this task? It's important to document new code, by adding docstrings in functions for example, because sometimes it might not be easy to understand what it does.

I'm also wondering whether the task can be automated more.

For example, it should be easy to download the hg38 genome in the Python script directly (using requests and chunking), and at the same time documenting the exact source of the FASTA file in the code.

Also, how about using pybedtools to run the bed generation task from inside the script?

Ideally, the anno.py script will become the script that generates the dataset from start to finish, by just running a single command, maybe as simple as python generate_dataset.py.

I also opened an issue for exploring getting the annotations from the Dfam files instead of the API: #3

I hope that's not too many tasks, take your time with them!

yangtcai commented 2 years ago

hi, @williamstark01, I have modify the code to use pybedtools, and next step is that integrate the annotations and the sequence into one file.

williamstark01 commented 2 years ago

Awesome, everything looks like good work! Funny emojis!

Great idea using a config file for hardcoded values.

Before committing changes with git, it's a good habit to adopt to format all code with Black. It makes the code significantly more readable after you get used to it. Give it a try!

Did you discover how Poetry works? To install the pybedtools package for example, you would run poetry add pybedtools and then you would commit the updated pyproject.toml and poetry.lock. Then everyone would be able to quickly replicate the environment with simply running poetry install. (What OS are you using by the way?)

For documentation, it's best to add it early, usually before even starting the code implementation. You just write a couple of sentences in the function / class / method docstring as soon as you create them, and describe what they will do. It helps a lot with understanding the logic of the program, both for other contributors but also for you in the future. And also, that will help with documenting the process of generating the dataset, training the model, etc, which will be useful in writing the paper later on. Try to go through all functions and add a short description of what they do.

And I'm thinking it might be worthwhile to add a markdown document in the repo, maybe called log.md, which you would update along with the code implementation and also results of experiments in the future. What do you think?

Have a nice weekend, talk to you next week!

yangtcai commented 2 years ago

Hi, @williamstark01, the code have been updated to use in one commond to generate the sequence datasests and annotation datasets, PTAL ; D.

williamstark01 commented 2 years ago

Great coding, one thing you can improve is remembering to commit your work frequently, do a commit as soon as a unit of work is complete. Many commits are better than few commits, they help with tracking the changes easier. That will help a lot and save time in reviews!

EnsemblGSOC / Ensembl-Repeat-Identification

Generate sequence and annotation datasets #2