Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

Proposed contribution task for Outreachy applicants: Register NCBI assembly Felis_catus_9.0 #50

Closed hpages closed 2 years ago

hpages commented 2 years ago

Felis_catus_9.0 is a Cat assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000181335.3/

Note that Felis_catus_9.0 is the assembly that felCat9, the latest UCSC genome for Cat, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Cat in the UCSC species tree on the left, click on it, then make sure to select the latest Cat Assembly (felCat9). This will display a bunch of additional information about the felCat9 assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*) or RefSeq (GCF_000*.*) accession number.

Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The registered_NCBI_assemblies() function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is that getChromInfoFromNCBI() still works on an unregistered assembly, but in "degraded" mode, that is:

Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI() will recognize its name and return accurate circularity flags.

See ?getChromInfoFromNCBI (after loading GenomeInfoDb) for more information.

Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in GenomeInfoDb/inst/registered/NCBI_assemblies/. If this is a new organism, then we need to start a new file. See the other files for the naming scheme: the name of the file must be the full scientific name of the organism, with the underscore used as separator, and with the first letter capitalized. File extension must be .R.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Iman-L commented 2 years ago

Hello, I would like to work on this task

hpages commented 2 years ago

Hi @Linjelium11, these tasks are for Outreachy applicants. Are you an Outreachy applicant? If so, please contact me using my contact information provided on the Outreachy site and introduce yourself. Looking forward to hear from you. Thank you!

Iman-L commented 2 years ago

Hello, this is Iman again. I would like to be assigned to this task.

hpages commented 2 years ago

@Iman-L This task is part of the "Cat" group. You'll be working on the "Gorilla" group. But first see the email I'm about to send you. Thanks

kakopo commented 2 years ago

Hello @hpages could I please be assigned this task?

hpages commented 2 years ago

Done. I'm ready to answer your questions, don't hesitate!

kakopo commented 2 years ago

Awesome!

kakopo commented 2 years ago

Hi @hpages. I am (tentatively) done with this task. I ran into a few issues, at least regarding unexpected output, but thankfully, I was able to refer to your conversation with @Priceless-P regarding issue #44. Turns out, I was facing the same exact problem and it helped. (Prisca has also assisted me tremendously via Slack). Everything prior to testing the package is as expected (via command line). Running R CMD build worked fine, but R CMD check GenomeInfoDb_1.33.13.tar.gz --no-tests brings about this output

* checking examples ... ERROR
Running examples in ‘GenomeInfoDb-Ex.R’ failed
The error most likely occurred in:

> ### Name: seqlevelsStyle
> ### Title: Conveniently rename the seqlevels of an object according to a
> ###   given style
> ### Aliases: seqlevelsStyle seqlevelsStyle<- seqlevelsStyle,ANY-method
> ###   seqlevelsStyle<-,ANY-method seqlevelsStyle,Seqinfo-method
> ###   seqlevelsStyle<-,Seqinfo-method seqlevelsStyle,character-method
> ###   seqlevelsStyle<-,character-method genomeStyles extractSeqlevels
> ###   extractSeqlevelsByGroup mapSeqlevels seqlevelsInGroup
> 
> ### ** Examples
> 
> ## ---------------------------------------------------------------------
> ## seqlevelsStyle() getter and setter
> ## ---------------------------------------------------------------------
> 
> ## On a character vector:
> x <- paste0("chr", 1:5)
> seqlevelsStyle(x)
[1] "UCSC"
> seqlevelsStyle(x) <- "NCBI"
> x
[1] "1" "2" "3" "4" "5"
> 
> ## On a GRanges object:
> library(GenomicRanges)
> gr <- GRanges(rep(c("chr2", "chr3", "chrM"), 2), IRanges(1:6, 10))
> 
> seqlevelsStyle(gr)
[1] "UCSC"
> seqlevelsStyle(gr) <- "NCBI"
> gr
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        2      1-10      *
  [2]        3      2-10      *
  [3]       MT      3-10      *
  [4]        2      4-10      *
  [5]        3      5-10      *
  [6]       MT      6-10      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> 
> seqlevelsStyle(gr)
[1] "NCBI"    "Ensembl"
> seqlevelsStyle(gr) <- "dbSNP"
> gr
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]      ch2      1-10      *
  [2]      ch3      2-10      *
  [3]     chMT      3-10      *
  [4]      ch2      4-10      *
  [5]      ch3      5-10      *
  [6]     chMT      6-10      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> 
> seqlevelsStyle(gr)
[1] "dbSNP"
> seqlevelsStyle(gr) <- "UCSC"
> gr
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2      1-10      *
  [2]     chr3      2-10      *
  [3]     chrM      3-10      *
  [4]     chr2      4-10      *
  [5]     chr3      5-10      *
  [6]     chrM      6-10      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> 
> ## In general the seqlevelsStyle() setter doesn't know how to rename
> ## scaffolds. However, if the genome is specified, it's very likely
> ## that seqlevelsStyle() will be able to take advantage of that:
> gr <- GRanges(rep(c("2", "Y", "Hs6_111610_36"), 2), IRanges(1:6, 10))
> genome(gr) <- "NCBI36"
> seqlevelsStyle(gr) <- "UCSC"
> gr
GRanges object with 6 ranges and 0 metadata columns:
           seqnames    ranges strand
              <Rle> <IRanges>  <Rle>
  [1]          chr2      1-10      *
  [2]          chrY      2-10      *
  [3] chr6_cox_hap1      3-10      *
  [4]          chr2      4-10      *
  [5]          chrY      5-10      *
  [6] chr6_cox_hap1      6-10      *
  -------
  seqinfo: 3 sequences from hg18 genome; no seqlengths
> 
> ## On a Seqinfo object:
> si <- si0 <- Seqinfo(genome="apiMel2")
> si
Seqinfo object with 17 sequences from apiMel2 genome:
  seqnames seqlengths isCircular  genome
  Group1     25433307      FALSE apiMel2
  Group2     14233499      FALSE apiMel2
  Group3     11790680      FALSE apiMel2
  Group4     10665743      FALSE apiMel2
  Group5     13312070      FALSE apiMel2
  ...             ...        ...     ...
  Group13     8846036      FALSE apiMel2
  Group14     8017760      FALSE apiMel2
  Group15     7326077      FALSE apiMel2
  Group16     5608962      FALSE apiMel2
  GroupUn   321143811      FALSE apiMel2
> seqlevelsStyle(si) <- "NCBI"
Warning in (function (seqlevels, genome, new_style)  :
  cannot switch some of apiMel2's seqlevels from UCSC to NCBI style
> si
Seqinfo object with 17 sequences from 2 genomes (Amel_2.0, apiMel2):
  seqnames seqlengths isCircular   genome
  LG1        25433307      FALSE Amel_2.0
  LG2        14233499      FALSE Amel_2.0
  LG3        11790680      FALSE Amel_2.0
  LG4        10665743      FALSE Amel_2.0
  LG5        13312070      FALSE Amel_2.0
  ...             ...        ...      ...
  LG13        8846036      FALSE Amel_2.0
  LG14        8017760      FALSE Amel_2.0
  LG15        7326077      FALSE Amel_2.0
  LG16        5608962      FALSE Amel_2.0
  GroupUn   321143811      FALSE  apiMel2
> seqlevelsStyle(si) <- "RefSeq"
Warning in (function (seqlevels, genome, new_style)  :
  cannot switch apiMel2's seqlevels from UCSC to RefSeq style
> si
Seqinfo object with 17 sequences from 2 genomes (Amel_2.0, apiMel2):
  seqnames    seqlengths isCircular   genome
  NC_007070.1   25433307      FALSE Amel_2.0
  NC_007071.1   14233499      FALSE Amel_2.0
  NC_007072.1   11790680      FALSE Amel_2.0
  NC_007073.1   10665743      FALSE Amel_2.0
  NC_007074.1   13312070      FALSE Amel_2.0
  ...                ...        ...      ...
  NC_007082.1    8846036      FALSE Amel_2.0
  NC_007083.1    8017760      FALSE Amel_2.0
  NC_007084.1    7326077      FALSE Amel_2.0
  NC_007085.1    5608962      FALSE Amel_2.0
  GroupUn      321143811      FALSE  apiMel2
> seqlevelsStyle(si) <- "UCSC"
> stopifnot(identical(si0, si))
> 
> si <- si0 <- Seqinfo(genome="WBcel235")
> si
Seqinfo object with 7 sequences (1 circular) from WBcel235 genome:
  seqnames seqlengths isCircular   genome
  I          15072434      FALSE WBcel235
  II         15279421      FALSE WBcel235
  III        13783801      FALSE WBcel235
  IV         17493829      FALSE WBcel235
  V          20924180      FALSE WBcel235
  X          17718942      FALSE WBcel235
  MT            13794       TRUE WBcel235
> seqlevelsStyle(si) <- "UCSC"
> si
Seqinfo object with 7 sequences (1 circular) from ce11 genome:
  seqnames seqlengths isCircular genome
  chrI       15072434      FALSE   ce11
  chrII      15279421      FALSE   ce11
  chrIII     13783801      FALSE   ce11
  chrIV      17493829      FALSE   ce11
  chrV       20924180      FALSE   ce11
  chrX       17718942      FALSE   ce11
  chrM          13794       TRUE   ce11
> seqlevelsStyle(si) <- "RefSeq"
> si
Seqinfo object with 7 sequences (1 circular) from WBcel235 genome:
  seqnames     seqlengths isCircular   genome
  NC_003279.8    15072434      FALSE WBcel235
  NC_003280.10   15279421      FALSE WBcel235
  NC_003281.10   13783801      FALSE WBcel235
  NC_003282.8    17493829      FALSE WBcel235
  NC_003283.11   20924180      FALSE WBcel235
  NC_003284.9    17718942      FALSE WBcel235
  NC_001328.1       13794       TRUE WBcel235
> seqlevelsStyle(si) <- "NCBI"
> stopifnot(identical(si0, si))
> 
> si <- Seqinfo(genome="macFas5")
> si
Seqinfo object with 7601 sequences (1 circular) from macFas5 genome:
  seqnames       seqlengths isCircular  genome
  chr1            227556264      FALSE macFas5
  chr2            192460366      FALSE macFas5
  chr3            192294377      FALSE macFas5
  chr4            170955103      FALSE macFas5
  chr5            189454096      FALSE macFas5
  ...                   ...        ...     ...
  chrUn_KE147191       5581      FALSE macFas5
  chrUn_KE147192      21004      FALSE macFas5
  chrUn_KE147193       5184      FALSE macFas5
  chrUn_KE147194       8082      FALSE macFas5
  chrUn_KE147195       7181      FALSE macFas5
> seqlevelsStyle(si) <- "NCBI"
Error in download.file(url, destfile, quiet = TRUE) : 
  cannot open URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/364/345/GCA_000364345.1_Macaca_fascicularis_5.0/GCA_000364345.1_Macaca_fascicularis_5.0_assembly_report.txt'
Calls: seqlevelsStyle<- ... .fetch_assembly_report_from_url -> fetch_table_from_url
Execution halted

This is what my script currently looks like

ORGANISM <- "Felis catus"

### List of assemblies by date.
ASSEMBLIES <- list(
    list(assembly="Felis_catus_9.0",
         date="2017/11/20",
         extra_info=c(breed="Abyssinian"),
         assembly_accession="GCF_000181335.3",  # felCat9
         circ_seqs=c("MT"))
)

Thank you

hpages commented 2 years ago

@kakopo Great to hear that the discussion in issue #44 helped you, and that Prisca was able to assist you via Slack!

To me, your Felis_catus.R file looks good. I don't know why you're getting this error in the examples of the seqlevelsStyle man page. Looks like some connectivity issue with the NCBI FTP site (ftp.ncbi.nlm.nih.gov). Probably a temporary thing.

Try to run the examples in an interactive R session by loading the GenomeInfoDb package and doing:

example(seqlevelsStyle)

This will run all the examples in the seqlevelsStyle man page (?seqlevelsStyle). Hopefully that works.

Then try R CMD check --no-tests again and maybe this time you'll have better luck. Yeah, examples and unit tests that access the internet can fail sometimes, in unpredictable ways. :disappointed:

If after a couple of attempts, you still run into this problem, then ignore.

I would just avoid wrapping "MT" inside c(), because c("MT") is the same as "M" (try identical(c("MT"), "M")), so you don't need it (it's actually not considered good R programming style).

Then add the file to git, commit, push, and submit a PR.

Thanks!

kakopo commented 2 years ago

example(seqlevelsStyle) worked instantenously. Thank you. I hadn't realised I'd left the c("MT") part of the code - I blindly referenced another script and had no idea I had kept it in. Definitely a lesson in being more intentional! Otherwise, my curiosity lies in the fact that I noticed the fact that NCBI has 2 versions of the Felis catus assembly, with the more recent version containing a slightly updated GenBank assembly accession (but no RefSeq assembly accession). I used the older version, but what difference would using the newer one make, given that it seems incomplete? I've also been meaning to ask what exactly circ_seqs means within the code. I've just submitted the PR. Thanks for everything!

hpages commented 2 years ago

Hi @kakopo

example(seqlevelsStyle) worked instantenously.

Excellent!

I've also been meaning to ask what exactly circ_seqs means within the code.

circ_seqs stands for "circular sequences". All the sequences in a genome assembly are DNA sequences. Most of the time, DNA molecules are linear, but sometimes they are circular. For example, the autosomes in Human (that is, chromosomes 1-22, X, and Y) are linear DNA molecules. But the mitochondrial chromosome, which is a very small DNA molecule compared to the autosomes, is circular. Scaffolds are always considered linear.

For Cat, it's going to be the same: all the sequences in the assembly are linear, except for MT. We can check that MT is circular as follow:

So now we have confirmation that the MT sequence in the Felis_catus_9.0 assembly is circular. Note that we've also marked the mitochondrial chromosome in UCSC genome felCat9 as circular (we'd better be consistent! :wink: ), by putting the following line in felCat9.R:

CIRC_SEQS <- "chrM"

chrM in felCat9 is the same sequence as MT in Felis_catus_9.0, only the names are different.

See https://en.wikipedia.org/wiki/Mitochondrial_DNA if you are interested to learn more about Mitochondrial DNA.

The two most important things to remember are:

I noticed the fact that NCBI has 2 versions of the Felis catus assembly

I suppose you mean they have 2 assemblies named Felis_catus_9.0. Good catch!

The fact that they have several assemblies for the same organism (Felis catus in this case) is not a problem and is pretty common. The NCBI assembly page for GCF_000181335.3 has this blue box that says that there are 5 assemblies for this organism:

Screenshot from 2022-10-21 09-15-03

If you click on the link in that box, you will see the list of all NCBI assemblies for Felis catus.

But what's unusual and confusing here is that 2 of those assemblies are named Felis_catus_9.0. I honestly don't know why. Let's remember that NCBI is just a huge centralized repository where people can submit assemblies (these are usually the people working on the assemblies). I guess people are free to name the assembly as they want. However, a given submission will always have a unique GenBank or RefSeq assembly accession.

The lesson here is that assembly names are unreliable and messy. The only thing that we should rely on to identify an NCBI assembly unambiguously is a GenBank or RefSeq assembly accession.

In this particular case, we deliberately choose to register the Felis_catus_9.0 assembly associated with GCF_000181335.3, and not the other one, because this is the assembly that UCSC genome felCat9 is based on.

All is good with your PR. Just a very minor cosmetic detail. If you look at the Files changed, it shows a little red circle below the last line of the file. If you move the mouse over it, a little popup will say what the problem is. Can you please address that?

Thanks

kakopo commented 2 years ago

Quite interesting! I've just done a bit of light reading on the difference between circular and linear chromosomes and everything is definitely making much more sense. This entire field is quite complex, and quite interesting to dive into. Thank you for answering my questions, and for the resources, always. I had made changes to get rid of the red circle. I am unsure as to why it is not reflecting as such, especially since I added the new line locally and pushed once more. My most recent commit is from 2 hours ago. Is this the current file on your end?

Screenshot from 2022-10-21 21-13-29

hpages commented 2 years ago

Looks good on my end. See https://github.com/Bioconductor/GenomeInfoDb/pull/64/files

PR #64 merged, thanks @kakopo

Next task in your group is issue #51. Whenever you are ready, go there and ask to be assigned.

kakopo commented 2 years ago

Awesome, thank you as well