Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

Proposed contribution task for Outreachy applicants: Register NCBI assembly Dog10K_Boxer_Tasha #44

Closed hpages closed 2 years ago

hpages commented 2 years ago

Dog10K_Boxer_Tasha is a Dog assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000002285.5/

Note that Dog10K_Boxer_Tasha is the assembly that canCam6, the latest UCSC genome for Dog, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Dog in the UCSC species tree on the left, click on it, then make sure to select the latest Dog Assembly (canFam6). This will display a bunch of additional information about the canFam6 assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*) or RefSeq (GCF_000*.*) accession number.

Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The registered_NCBI_assemblies() function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is that getChromInfoFromNCBI() still works on an unregistered assembly, but in "degraded" mode, that is:

Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI() will recognize its name and return accurate circularity flags.

See ?getChromInfoFromNCBI (after loading GenomeInfoDb) for more information.

Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in GenomeInfoDb/inst/registered/NCBI_assemblies/.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Simplecodez commented 2 years ago

Hi @hpages, please I would like to be assigned this task but can't find the assign button. I have completed the preliminary task.

hpages commented 2 years ago

Hi @Simplecodez,

Were you able to install Linux on your machine? Do you have any question about the preliminary tasks? Don't hesitate to ask. You can ask me by email or in the #outreachy channel on the community-bioc Slack (don't ask questions about these Preliminary tasks here, in this issue, to stay on-topic).

Would you mind choosing the "Register NCBI assembly UCB_Xtro_10.0" issue instead? It's the same as this issue but for a different NCBI assembly. The reason I'm asking this is because another applicant is already working on the first group of tasks. See: https://github.com/Bioconductor/BSgenomeForge/wiki/List-of-contribution-tasks-for-the-Outreachy-application-period

Thanks, H.

Simplecodez commented 2 years ago

Goodday sir @hpages, I don't mind mind. Please can you assign to that task?

Simplecodez commented 2 years ago

I am also done with the preliminary tasks i was able to install Linux on my machine.

Priceless-P commented 2 years ago

@hpages, please can you assign this project to me?

hpages commented 2 years ago

Done. There's currently very little information about how to register a new NCBI assembly, sorry. I'll need to improve this. In the meantime I expect that you'll have a lot of questions for me. I'm ready! :wink:

Priceless-P commented 2 years ago

@hpages sure! 😂.

hpages commented 2 years ago

One important link on the NCBI page for any assembly is the link to the "Full sequence report" on the right:

Screenshot from 2022-10-17 11-24-34

The "Full sequence report" is a tab-delimited file describing all the sequences in the assembly. This is the file that getChromInfoFromNCBI() downloads and returns in a data frame. Note that because Dog10K_Boxer_Tasha is not registered yet, you must pass a GenBank or RefSeq assembly accession to getChromInfoFromNCBI():

getChromInfoFromNCBI("Dog10K_Boxer_Tasha")  # does not work at the moment
getChromInfoFromNCBI("GCA_000002285.4")     # works (in degraded mode)
getChromInfoFromNCBI("GCF_000002285.5")     # works (in degraded mode)

See ?getChromInfoFromNCBI for more information.

We can register an NCBI assembly either with its GenBank or its RefSeq assembly accession, but not with both. So we need to choose. It's recommended to compare the two data frames returned by getChromInfoFromNCBI() before we choose. Normally they are identical, but sometimes they are not (this is a rare situation):

Priceless-P commented 2 years ago

Hi @hpages , As you said ,

getChromInfoFromNCBI("GCA_000002285.4")     # works (in degraded mode)
getChromInfoFromNCBI("GCF_000002285.5")     # works (in degraded mode)

But after adding these lines of code to /NCBI_assembliesCanis_lupus_familiaris.R

list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")

I expected that getChromInfoFromNCBI("Dog10K_Boxer_Tasha") would work also but it doesn't. Please what am I missing? How do I get it to download the information on the "Full sequence report" page?

P:S: I didn't find any difference between GenBank and its RefSeq assembly accession so I used the Accession ID

hpages commented 2 years ago

Did you reinstall GenomeInfoDb after editing Canis_lupus_familiaris.R in GenomeInfoDb/inst/registered/NCBI_assemblies/?

Always reinstall the package and load it in a fresh R session to see the effects of your changes. In this particular case, before you even try getChromInfoFromNCBI(), you should check that the data frame returned by registered_NCBI_assemblies() has a new entry for Dog10K_Boxer_Tasha. Check all the fields in the new entry: they should reflect what you've put in Canis_lupus_familiaris.R for Dog10K_Boxer_Tasha.

Priceless-P commented 2 years ago

@hpages I reinstall GenomeInfoDb after the edit and also loaded it in a new session but registered_NCBI_assemblies() still didn't include Dog10K_Boxer_Tasha. I have done it a number of times and it's the same result. I also tried to use the GenBank assembly accession, didn't also work.

hpages commented 2 years ago

Where are you putting

    list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")

exactly? This should be added to the ASSEMBLIES list in Canis_lupus_familiaris.R. Note that ASSEMBLIES is a list of lists. Currently its length is 5. After you add the new entry for Dog10K_Boxer_Tasha, it will have length 6.

Priceless-P commented 2 years ago

Yes, I added it in the Canis_lupus_familiaris.R file

Here's the full content of the file

ORGANISM <- "Canis lupus familiaris"

### List of assemblies first by breed then by date.
### Yep, different genome assemblies can have the same name! (don't ask me why)
### Lookup by genome name will pick-up the first in the list.
ASSEMBLIES <- list(
    ## breed: boxer
    list(assembly="CanFam2.0",
         date="2005/07/12",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.1",
         circ_seqs=character(0)),

    list(assembly="CanFam2.0",
         date="2005/07/12",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.2",  # canFam2
         circ_seqs="MT"),

    list(assembly="CanFam3.1",
         date="2011/11/02",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.3",  # canFam3
         circ_seqs="MT"),

    list(assembly="UMICH_Zoey_3.1",
         date="2019/05/30",
         extra_info=c(breed="Great Dane"),
         assembly_accession="GCA_005444595.1",  # canFam5
         circ_seqs="chrM"),

    list(assembly="UU_Cfam_GSD_1.0",
         date="2020/03/10",
         extra_info=c(breed="German Shepherd"),
         assembly_accession="GCA_011100685.1",  # canFam4
         circ_seqs="chrM"),

    list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")
)
hpages commented 2 years ago

I just copied what you show above in my own Canis_lupus_familiaris.R file, reinstalled GenomeInfoDb, started a fresh R session, loaded GenomeInfoDb (with library(GenomeInfoDb)), and I get:

> registered_NCBI_assemblies("Canis lupus familiaris")
                organism           assembly       date            extra_info
1 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
2 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
3 Canis lupus familiaris          CanFam3.1 2011/11/02           breed:boxer
4 Canis lupus familiaris     UMICH_Zoey_3.1 2019/05/30      breed:Great Dane
5 Canis lupus familiaris    UU_Cfam_GSD_1.0 2020/03/10 breed:German Shepherd
6 Canis lupus familiaris Dog10K_Boxer_Tasha 2020/10/06           breed:boxer
  assembly_accession circ_seqs
1    GCF_000002285.1          
2    GCF_000002285.2        MT
3    GCF_000002285.3        MT
4    GCA_005444595.1      chrM
5    GCA_011100685.1      chrM
6    GCF_000002285.5      chrM

I don't understand why this doesn't work for you.

Can you commit and push your changes to your fork so I can look at this? Thanks

Priceless-P commented 2 years ago

Then it must be from my end. I will keep trying it.

Here's my fork with the changes I made. https://github.com/Priceless-P/GenomeInfoDb/tree/Dog10K_Boxer_Tasha

hpages commented 2 years ago

Your fork works fine for me. Here is a transcript of what I did (I do everything in a terminal):

hpages@spectre:~/github/Priceless-P$ git clone https://github.com/Priceless-P/GenomeInfoDb.git
Cloning into 'GenomeInfoDb'...
remote: Enumerating objects: 3301, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 3301 (delta 22), reused 20 (delta 20), pack-reused 3271
Receiving objects: 100% (3301/3301), 84.67 MiB | 15.12 MiB/s, done.
Resolving deltas: 100% (2353/2353), done.

hpages@spectre:~/github/Priceless-P$ cd GenomeInfoDb

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ git checkout Dog10K_Boxer_Tasha
Branch 'Dog10K_Boxer_Tasha' set up to track remote branch 'Dog10K_Boxer_Tasha' from 'origin'.
Switched to a new branch 'Dog10K_Boxer_Tasha'

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ tail inst/registered/NCBI_assemblies/Canis_lupus_familiaris.R 
         extra_info=c(breed="German Shepherd"),
         assembly_accession="GCA_011100685.1",  # canFam4
         circ_seqs="chrM"),

    list(assembly="Dog10K_Boxer_Tasha",
         date="2020/10/06",
         extra_info=c(breed="boxer"),
         assembly_accession="GCF_000002285.5",  # canFam6
         circ_seqs="chrM")
)

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ R CMD INSTALL .
* installing to library ‘/home/hpages/R/R-4.2.r82318/library’
* installing *source* package ‘GenomeInfoDb’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (GenomeInfoDb)

hpages@spectre:~/github/Priceless-P/GenomeInfoDb$ R

R version 4.2.0 Patched (2022-05-04 r82318) -- "Vigorous Calisthenics"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomeInfoDb)
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges

> registered_NCBI_assemblies("Canis lupus familiaris")
                organism           assembly       date            extra_info
1 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
2 Canis lupus familiaris          CanFam2.0 2005/07/12           breed:boxer
3 Canis lupus familiaris          CanFam3.1 2011/11/02           breed:boxer
4 Canis lupus familiaris     UMICH_Zoey_3.1 2019/05/30      breed:Great Dane
5 Canis lupus familiaris    UU_Cfam_GSD_1.0 2020/03/10 breed:German Shepherd
6 Canis lupus familiaris Dog10K_Boxer_Tasha 2020/10/06           breed:boxer
  assembly_accession circ_seqs
1    GCF_000002285.1          
2    GCF_000002285.2        MT
3    GCF_000002285.3        MT
4    GCA_005444595.1      chrM
5    GCA_011100685.1      chrM
6    GCF_000002285.5      chrM

As you can see: no problem! Can you perform those exact commands in a terminal?

Priceless-P commented 2 years ago

It works now.💃 Thanks a lot! @hpages I missed R CMD install earlier.

I ran getChromInfoFromNCBI("Dog10K_Boxer_Tasha") I noticed that circ_seqs should not be chrM so i checked here I saw it should be MT instead so I corrected it. I will be opening a PR now.

hpages commented 2 years ago

I noticed that circ_seqs should not be chrM so i checked here I saw it should be MT instead so I corrected it.

Note that you can also see this by looking at the "Full sequence report" for Dog10K_Boxer_Tasha here. Mitochondrion is usually at the bottom of the file.

Priceless-P commented 2 years ago

Okay. I have noted that. Thank you @hpages

hpages commented 2 years ago

PR #53 merged, thanks @Priceless-P !

Next task in your group is issue #45. Whenever you are ready, go there and ask me to assign you.