MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks
Apache License 2.0
26 stars 9 forks source link

Refactor. Work in progress #62

Closed michael-kotliar closed 2 years ago

michael-kotliar commented 3 years ago

Questions to answer

  1. Who is going to use our software and how? What is their skills level?
    • Are we targeting the audience that is familiar with running tools from the command line?
    • Do we have any examples of the other similar software that our target audience finds easy to use?
    • How can we make our tool correspond to the skills level of our target audience without overcomplicating the development process?
    • Do we want to split data preprocessing (mapping, etc) and the actual maxATAC functionality into two independent "subprojects", or do we want to have all these steps within one tool?

      User should have an option to run mapping raw reads from FASTQ files with the special maxATAC command. However, if he has run mapping elsewhere, he can provide just bigwig (bam?) files.

  2. Where our software will be run?
    • Are we planning to use our software only on clusters with GPU support or on laptops too? Which functionality should be allowed in each of these cases (preprocess data and train models on the cluster, run prediction on laptop)?
  3. How are we going to ship our software?
    • Where are we planning to keep trained models? How are we going to provide access to them?

      If we create a docker container for maxATAC, we can keep our models inside that container. If users don't want to run our software from the docker container we need to provide a way to download the required models. I see it as a separate command something likeinit, which will pull the latest models from somewhere (location will depend on the size of the models files).

    • Do we have any other data except the actual code and trained models, that should be available for users as well?

      I saw that we have custom blacklisted regions, so we can download them when running the same init command.

    • Are we planning to provide some examples of our software results in an interactive way (free R-Shiny app)? Users (and reviewers as well) should be able to see at least some results before spending time and effort on running the tool by themselves. Also, it always looks more impressive if something has interactive UI (for example Bias, robustness, and scalability in differential expression analysis of single-cell RNA-seq data)
    • Are we considering any future integration with the existent data processing platforms? Dockstore, SciDAP, Terra, Galaxy, etc?
    • How can we make an installation routine smooth? Docker/singularity containers, PyPI package, SnakeMake/CWL workflows, etc?
  4. How can we make the development process fast and stable?
    • Keep the latest working version of the code in the main branch.
    • Keep all the docs in an organized manner within the GitHub repository (later we can use either readthedocs.org or a similar service for rendering our documentation/tutorials).
    • Every new feature is developed in a separate branch and is merged to the main branch through the approved pull request.
    • Every new feature should have at least a simple test so others can use it for pull request approval.
    • Every new feature should have extensive documentation on how it works and why do we need it.
    • Organize and keep small input data for tests.
    • Refactor repository to exclude all unnecessary data from the commit history.
    • At least for now, include all required data files as a git submodule, so we can keep it separate from the main code.

Things to consider adding to TODO list https://github.com/MiraldiLab/maxATAC/projects/1

michael-kotliar commented 3 years ago

These things are hardcoded. What if we want to use hg19 or even something else?

# Genomic resource constants
blacklist_path = os.path.join(os.path.dirname(__file__), "../../data/hg38_maxatac_blacklist.bed")
blacklist_bigwig_path = os.path.join(os.path.dirname(__file__), "../../data/hg38_maxatac_blacklist.bw")
chrom_sizes_path = os.path.join(os.path.dirname(__file__), "../../data/hg38.chrom.sizes")
tacazares commented 3 years ago

These things are hardcoded. What if we want to use hg19 or even something else?

# Genomic resource constants
blacklist_path = os.path.join(os.path.dirname(__file__), "../../data/hg38_maxatac_blacklist.bed")
blacklist_bigwig_path = os.path.join(os.path.dirname(__file__), "../../data/hg38_maxatac_blacklist.bw")
chrom_sizes_path = os.path.join(os.path.dirname(__file__), "../../data/hg38.chrom.sizes")

These files are hardcoded as the default options that are referenced by the argparser if none are provided. It is probably best to have the user specify these with each run instead of defaulting to a specific reference.

portah commented 3 years ago

Couple comments on original task

  • [ ] Repository is too big. We will need to remove all files that were added here by mistake. Removing files would not reduce the size. Repository has to be rebased too!
  • [ ] If possible add tests and set up some Continuous Integration (CI) CI means deploying it somewhere or doing something after test. So KISS limit this to tests only.
michael-kotliar commented 3 years ago

These things are hardcoded. What if we want to use hg19 or even something else?

# Genomic resource constants
blacklist_path = os.path.join(os.path.dirname(__file__), "../../data/hg38_maxatac_blacklist.bed")
blacklist_bigwig_path = os.path.join(os.path.dirname(__file__), "../../data/hg38_maxatac_blacklist.bw")
chrom_sizes_path = os.path.join(os.path.dirname(__file__), "../../data/hg38.chrom.sizes")

These files are hardcoded as the default options that are referenced by the argparser if none are provided. It is probably best to have the user specify these with each run instead of defaulting to a specific reference.

Ok, then for now it's better to keep them in a separate repository and include it as a submodule in the main one. This will allow us to separate the code and data and do not create a new commit on each change in the data files.

FaizRizvi commented 3 years ago

We tested a few architectures earlier, but have been sticking with using a dilated convolutional neural network (/GitHub/maxATAC/maxatac/architectures). Is it ok to delete the other archs for publication? (@emiraldi, @dlab-arp)

FaizRizvi commented 3 years ago

@michael-kotliar , we have the same dir here: /GitHub/maxATAC/packaging/scripts and /Users/war9qi/Documents/GitHub/maxATAC/data/scripts, which one should we delete?

FaizRizvi commented 3 years ago

@emiraldi should we remove all quantitative options from maxATAC?

emiraldi commented 3 years ago

For the public codebase, I would remove the quantitative options, but we will want to build on them in the future.

On Wed, Oct 13, 2021 at 10:47 AM FaizRizvi @.***> wrote:

@emiraldi https://github.com/emiraldi should we remove all quantitative options from maxATAC?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MiraldiLab/maxATAC/pull/62#issuecomment-942381816, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUJXVOGTSTUXNSGULW7DEDUGWLYHANCNFSM5FKLCVPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Emily Miraldi, Ph.D. Assistant Professor Divisions of Immunobiology and Biomedical Informatics Cincinnati Children's Hospital

FaizRizvi commented 3 years ago

@emiraldi should we remove all quantitative options from maxATAC?

@michael-kotliar what is the best way to save this work that includes quantitative work and other archs for future development private?