Proposed contribution task for Outreachy applicants: Register NCBI assembly Dog10K_Boxer_Tasha

hpages commented 2 years ago

Dog10K_Boxer_Tasha is a Dog assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000002285.5/

Note that Dog10K_Boxer_Tasha is the assembly that canCam6, the latest UCSC genome for Dog, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Dog in the UCSC species tree on the left, click on it, then make sure to select the latest Dog Assembly (canFam6). This will display a bunch of additional information about the canFam6 assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*) or RefSeq (GCF_000*.*) accession number.

Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The registered_NCBI_assemblies() function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is that getChromInfoFromNCBI() still works on an unregistered assembly, but in "degraded" mode, that is:

The name of the assembly is not recognized, only look up by GenBank or RefSeq accession works.
The returned circularity flags are not guaranteed to be accurate. This potential inaccuracy is communicated to the user by placing NAs instead of FALSEs in the circular column of the returned data.frame.

Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI() will recognize its name and return accurate circularity flags.

See ?getChromInfoFromNCBI (after loading GenomeInfoDb) for more information.

Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in GenomeInfoDb/inst/registered/NCBI_assemblies/.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Simplecodez commented 2 years ago

Hi @hpages, please I would like to be assigned this task but can't find the assign button. I have completed the preliminary task.

hpages commented 2 years ago

Hi @Simplecodez,

Were you able to install Linux on your machine? Do you have any question about the preliminary tasks? Don't hesitate to ask. You can ask me by email or in the #outreachy channel on the community-bioc Slack (don't ask questions about these Preliminary tasks here, in this issue, to stay on-topic).

Would you mind choosing the "Register NCBI assembly UCB_Xtro_10.0" issue instead? It's the same as this issue but for a different NCBI assembly. The reason I'm asking this is because another applicant is already working on the first group of tasks. See: https://github.com/Bioconductor/BSgenomeForge/wiki/List-of-contribution-tasks-for-the-Outreachy-application-period

Thanks, H.

Simplecodez commented 2 years ago

Goodday sir @hpages, I don't mind mind. Please can you assign to that task?

Simplecodez commented 2 years ago

I am also done with the preliminary tasks i was able to install Linux on my machine.

Priceless-P commented 2 years ago

@hpages, please can you assign this project to me?

hpages commented 2 years ago

Done. There's currently very little information about how to register a new NCBI assembly, sorry. I'll need to improve this. In the meantime I expect that you'll have a lot of questions for me. I'm ready! :wink:

Priceless-P commented 2 years ago

@hpages sure! 😂.

hpages commented 2 years ago

One important link on the NCBI page for any assembly is the link to the "Full sequence report" on the right:

Screenshot from 2022-10-17 11-24-34

The "Full sequence report" is a tab-delimited file describing all the sequences in the assembly. This is the file that getChromInfoFromNCBI() downloads and returns in a data frame. Note that because Dog10K_Boxer_Tasha is not registered yet, you must pass a GenBank or RefSeq assembly accession to getChromInfoFromNCBI():

getChromInfoFromNCBI("Dog10K_Boxer_Tasha")  # does not work at the moment
getChromInfoFromNCBI("GCA_000002285.4")     # works (in degraded mode)
getChromInfoFromNCBI("GCF_000002285.5")     # works (in degraded mode)

See ?getChromInfoFromNCBI for more information.

We can register an NCBI assembly either with its GenBank or its RefSeq assembly accession, but not with both. So we need to choose. It's recommended to compare the two data frames returned by getChromInfoFromNCBI() before we choose. Normally they are identical, but sometimes they are not (this is a rare situation):

If they are identical, then choosing one or the other doesn't really matter. However, if an UCSC genome is based on this assembly (like is the case here), we should use whatever the Accession ID field says on the Genome Browser Gateway page for the UCSC genome.
If they are not identical, then it's a more complicated situation. If this happens, we'll need to identify the differences and try to understand them. Then we'll be able to decide if they matter or not, and choose based on our assessment of the situation.

Priceless-P commented 2 years ago

Hi @hpages , As you said ,

getChromInfoFromNCBI("GCA_000002285.4")     # works (in degraded mode)
getChromInfoFromNCBI("GCF_000002285.5")     # works (in degraded mode)

But after adding these lines of code to /NCBI_assembliesCanis_lupus_familiaris.R

list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")

I expected that getChromInfoFromNCBI("Dog10K_Boxer_Tasha") would work also but it doesn't. Please what am I missing? How do I get it to download the information on the "Full sequence report" page?

P:S: I didn't find any difference between GenBank and its RefSeq assembly accession so I used the Accession ID

hpages commented 2 years ago

Did you reinstall GenomeInfoDb after editing Canis_lupus_familiaris.R in GenomeInfoDb/inst/registered/NCBI_assemblies/?

Always reinstall the package and load it in a fresh R session to see the effects of your changes. In this particular case, before you even try getChromInfoFromNCBI(), you should check that the data frame returned by registered_NCBI_assemblies() has a new entry for Dog10K_Boxer_Tasha. Check all the fields in the new entry: they should reflect what you've put in Canis_lupus_familiaris.R for Dog10K_Boxer_Tasha.

Priceless-P commented 2 years ago

@hpages I reinstall GenomeInfoDb after the edit and also loaded it in a new session but registered_NCBI_assemblies() still didn't include Dog10K_Boxer_Tasha. I have done it a number of times and it's the same result. I also tried to use the GenBank assembly accession, didn't also work.

hpages commented 2 years ago

Where are you putting

    list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")

exactly? This should be added to the ASSEMBLIES list in Canis_lupus_familiaris.R. Note that ASSEMBLIES is a list of lists. Currently its length is 5. After you add the new entry for Dog10K_Boxer_Tasha, it will have length 6.

Priceless-P commented 2 years ago

Yes, I added it in the Canis_lupus_familiaris.R file

Here's the full content of the file

ORGANISM <- "Canis lupus familiaris"

### List of assemblies first by breed then by date.
### Yep, different genome assemblies can have the same name! (don't ask me why)
### Lookup by genome name will pick-up the first in the list.
ASSEMBLIES <- list(
    ## breed: boxer
    list(assembly="CanFam2.0",
         date="2005/07/12",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.1",
         circ_seqs=character(0)),

    list(assembly="CanFam2.0",
         date="2005/07/12",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.2",  # canFam2
         circ_seqs="MT"),

    list(assembly="CanFam3.1",
         date="2011/11/02",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.3",  # canFam3
         circ_seqs="MT"),

    list(assembly="UMICH_Zoey_3.1",
         date="2019/05/30",
         extra_info=c(breed="Great Dane"),
         assembly_accession="GCA_005444595.1",  # canFam5
         circ_seqs="chrM"),

    list(assembly="UU_Cfam_GSD_1.0",
         date="2020/03/10",
         extra_info=c(breed="German Shepherd"),
         assembly_accession="GCA_011100685.1",  # canFam4
         circ_seqs="chrM"),

    list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")
)

hpages commented 2 years ago

I just copied what you show above in my own Canis_lupus_familiaris.R file, reinstalled GenomeInfoDb, started a fresh R session, loaded GenomeInfoDb (with library(GenomeInfoDb)), and I get:

> registered_NCBI_assemblies("Canis lupus familiaris")
                organism           assembly       date            extra_info
1 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
2 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
3 Canis lupus familiaris          CanFam3.1 2011/11/02           breed:boxer
4 Canis lupus familiaris     UMICH_Zoey_3.1 2019/05/30      breed:Great Dane
5 Canis lupus familiaris    UU_Cfam_GSD_1.0 2020/03/10 breed:German Shepherd
6 Canis lupus familiaris Dog10K_Boxer_Tasha 2020/10/06           breed:boxer
  assembly_accession circ_seqs
1    GCF_000002285.1          
2    GCF_000002285.2        MT
3    GCF_000002285.3        MT
4    GCA_005444595.1      chrM
5    GCA_011100685.1      chrM
6    GCF_000002285.5      chrM

I don't understand why this doesn't work for you.

Can you commit and push your changes to your fork so I can look at this? Thanks

Priceless-P commented 2 years ago

Then it must be from my end. I will keep trying it.

Here's my fork with the changes I made. https://github.com/Priceless-P/GenomeInfoDb/tree/Dog10K_Boxer_Tasha

hpages commented 2 years ago

Your fork works fine for me. Here is a transcript of what I did (I do everything in a terminal):

hpages@spectre:~/github/Priceless-P$ git clone https://github.com/Priceless-P/GenomeInfoDb.git
Cloning into 'GenomeInfoDb'...
remote: Enumerating objects: 3301, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 3301 (delta 22), reused 20 (delta 20), pack-reused 3271
Receiving objects: 100% (3301/3301), 84.67 MiB | 15.12 MiB/s, done.
Resolving deltas: 100% (2353/2353), done.

hpages@spectre:~/github/Priceless-P$ cd GenomeInfoDb

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ git checkout Dog10K_Boxer_Tasha
Branch 'Dog10K_Boxer_Tasha' set up to track remote branch 'Dog10K_Boxer_Tasha' from 'origin'.
Switched to a new branch 'Dog10K_Boxer_Tasha'

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ tail inst/registered/NCBI_assemblies/Canis_lupus_familiaris.R 
         extra_info=c(breed="German Shepherd"),
         assembly_accession="GCA_011100685.1",  # canFam4
         circ_seqs="chrM"),

    list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")
)

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ R CMD INSTALL .
* installing to library ‘/home/hpages/R/R-4.2.r82318/library’
* installing *source* package ‘GenomeInfoDb’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (GenomeInfoDb)

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ R

R version 4.2.0 Patched (2022-05-04 r82318) -- "Vigorous Calisthenics"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomeInfoDb)
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges

> registered_NCBI_assemblies("Canis lupus familiaris")
                organism           assembly       date            extra_info
1 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
2 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
3 Canis lupus familiaris          CanFam3.1 2011/11/02           breed:boxer
4 Canis lupus familiaris     UMICH_Zoey_3.1 2019/05/30      breed:Great Dane
5 Canis lupus familiaris    UU_Cfam_GSD_1.0 2020/03/10 breed:German Shepherd
6 Canis lupus familiaris Dog10K_Boxer_Tasha 2020/10/06           breed:boxer
  assembly_accession circ_seqs
1    GCF_000002285.1          
2    GCF_000002285.2        MT
3    GCF_000002285.3        MT
4    GCA_005444595.1      chrM
5    GCA_011100685.1      chrM
6    GCF_000002285.5      chrM

As you can see: no problem! Can you perform those exact commands in a terminal?

Priceless-P commented 2 years ago

It works now.💃 Thanks a lot! @hpages I missed R CMD install earlier.

I ran getChromInfoFromNCBI("Dog10K_Boxer_Tasha") I noticed that circ_seqs should not be chrM so i checked here I saw it should be MT instead so I corrected it. I will be opening a PR now.

hpages commented 2 years ago

I noticed that circ_seqs should not be chrM so i checked here I saw it should be MT instead so I corrected it.

Note that you can also see this by looking at the "Full sequence report" for Dog10K_Boxer_Tasha here. Mitochondrion is usually at the bottom of the file.

Priceless-P commented 2 years ago

Okay. I have noted that. Thank you @hpages

hpages commented 2 years ago

PR #53 merged, thanks @Priceless-P !

Next task in your group is issue #45. Whenever you are ready, go there and ask me to assign you.

Bioconductor / GenomeInfoDb

Proposed contribution task for Outreachy applicants: Register NCBI assembly Dog10K_Boxer_Tasha #44