Proposed contribution task for Outreachy applicants: Register NCBI assembly Felis_catus_9.0

Felis_catus_9.0 is a Cat assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000181335.3/

Note that Felis_catus_9.0 is the assembly that felCat9, the latest UCSC genome for Cat, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Cat in the UCSC species tree on the left, click on it, then make sure to select the latest Cat Assembly (felCat9). This will display a bunch of additional information about the felCat9 assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*) or RefSeq (GCF_000*.*) accession number.

Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The registered_NCBI_assemblies() function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is that getChromInfoFromNCBI() still works on an unregistered assembly, but in "degraded" mode, that is:

The name of the assembly is not recognized, only look up by GenBank or RefSeq accession works.
The returned circularity flags are not guaranteed to be accurate. This potential inaccuracy is communicated to the user by placing NAs instead of FALSEs in the circular column of the returned data.frame.

Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb, getChromInfoFromNCBI() will recognize its name and return accurate circularity flags.

See ?getChromInfoFromNCBI (after loading GenomeInfoDb) for more information.

Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in GenomeInfoDb/inst/registered/NCBI_assemblies/. If this is a new organism, then we need to start a new file. See the other files for the naming scheme: the name of the file must be the full scientific name of the organism, with the underscore used as separator, and with the first letter capitalized. File extension must be .R.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Hello, I would like to work on this task

Hi @Linjelium11, these tasks are for Outreachy applicants. Are you an Outreachy applicant? If so, please contact me using my contact information provided on the Outreachy site and introduce yourself. Looking forward to hear from you. Thank you!

Hello, this is Iman again. I would like to be assigned to this task.

@Iman-L This task is part of the "Cat" group. You'll be working on the "Gorilla" group. But first see the email I'm about to send you. Thanks

Hello @hpages could I please be assigned this task?

Done. I'm ready to answer your questions, don't hesitate!

Hi @hpages. I am (tentatively) done with this task. I ran into a few issues, at least regarding unexpected output, but thankfully, I was able to refer to your conversation with @Priceless-P regarding issue #44. Turns out, I was facing the same exact problem and it helped. (Prisca has also assisted me tremendously via Slack). Everything prior to testing the package is as expected (via command line). Running R CMD build worked fine, but R CMD check GenomeInfoDb_1.33.13.tar.gz --no-tests brings about this output

* checking examples ... ERROR
Running examples in ‘GenomeInfoDb-Ex.R’ failed
The error most likely occurred in:

> ### Name: seqlevelsStyle
> ### Title: Conveniently rename the seqlevels of an object according to a
> ###   given style
> ### Aliases: seqlevelsStyle seqlevelsStyle<- seqlevelsStyle,ANY-method
> ###   seqlevelsStyle<-,ANY-method seqlevelsStyle,Seqinfo-method
> ###   seqlevelsStyle<-,Seqinfo-method seqlevelsStyle,character-method
> ###   seqlevelsStyle<-,character-method genomeStyles extractSeqlevels
> ###   extractSeqlevelsByGroup mapSeqlevels seqlevelsInGroup
> 
> ### ** Examples
> 
> ## ---------------------------------------------------------------------
> ## seqlevelsStyle() getter and setter
> ## ---------------------------------------------------------------------
> 
> ## On a character vector:
> x <- paste0("chr", 1:5)
> seqlevelsStyle(x)
[1] "UCSC"
> seqlevelsStyle(x) <- "NCBI"
> x
[1] "1" "2" "3" "4" "5"
> 
> ## On a GRanges object:
> library(GenomicRanges)
> gr <- GRanges(rep(c("chr2", "chr3", "chrM"), 2), IRanges(1:6, 10))
> 
> seqlevelsStyle(gr)
[1] "UCSC"
> seqlevelsStyle(gr) <- "NCBI"
> gr
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        2      1-10      *
  [2]        3      2-10      *
  [3]       MT      3-10      *
  [4]        2      4-10      *
  [5]        3      5-10      *
  [6]       MT      6-10      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> 
> seqlevelsStyle(gr)
[1] "NCBI"    "Ensembl"
> seqlevelsStyle(gr) <- "dbSNP"
> gr
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]      ch2      1-10      *
  [2]      ch3      2-10      *
  [3]     chMT      3-10      *
  [4]      ch2      4-10      *
  [5]      ch3      5-10      *
  [6]     chMT      6-10      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> 
> seqlevelsStyle(gr)
[1] "dbSNP"
> seqlevelsStyle(gr) <- "UCSC"
> gr
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2      1-10      *
  [2]     chr3      2-10      *
  [3]     chrM      3-10      *
  [4]     chr2      4-10      *
  [5]     chr3      5-10      *
  [6]     chrM      6-10      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
> 
> ## In general the seqlevelsStyle() setter doesn't know how to rename
> ## scaffolds. However, if the genome is specified, it's very likely
> ## that seqlevelsStyle() will be able to take advantage of that:
> gr <- GRanges(rep(c("2", "Y", "Hs6_111610_36"), 2), IRanges(1:6, 10))
> genome(gr) <- "NCBI36"
> seqlevelsStyle(gr) <- "UCSC"
> gr
GRanges object with 6 ranges and 0 metadata columns:
           seqnames    ranges strand
              <Rle> <IRanges>  <Rle>
  [1]          chr2      1-10      *
  [2]          chrY      2-10      *
  [3] chr6_cox_hap1      3-10      *
  [4]          chr2      4-10      *
  [5]          chrY      5-10      *
  [6] chr6_cox_hap1      6-10      *
  -------
  seqinfo: 3 sequences from hg18 genome; no seqlengths
> 
> ## On a Seqinfo object:
> si <- si0 <- Seqinfo(genome="apiMel2")
> si
Seqinfo object with 17 sequences from apiMel2 genome:
  seqnames seqlengths isCircular  genome
  Group1     25433307      FALSE apiMel2
  Group2     14233499      FALSE apiMel2
  Group3     11790680      FALSE apiMel2
  Group4     10665743      FALSE apiMel2
  Group5     13312070      FALSE apiMel2
  ...             ...        ...     ...
  Group13     8846036      FALSE apiMel2
  Group14     8017760      FALSE apiMel2
  Group15     7326077      FALSE apiMel2
  Group16     5608962      FALSE apiMel2
  GroupUn   321143811      FALSE apiMel2
> seqlevelsStyle(si) <- "NCBI"
Warning in (function (seqlevels, genome, new_style)  :
  cannot switch some of apiMel2's seqlevels from UCSC to NCBI style
> si
Seqinfo object with 17 sequences from 2 genomes (Amel_2.0, apiMel2):
  seqnames seqlengths isCircular   genome
  LG1        25433307      FALSE Amel_2.0
  LG2        14233499      FALSE Amel_2.0
  LG3        11790680      FALSE Amel_2.0
  LG4        10665743      FALSE Amel_2.0
  LG5        13312070      FALSE Amel_2.0
  ...             ...        ...      ...
  LG13        8846036      FALSE Amel_2.0
  LG14        8017760      FALSE Amel_2.0
  LG15        7326077      FALSE Amel_2.0
  LG16        5608962      FALSE Amel_2.0
  GroupUn   321143811      FALSE  apiMel2
> seqlevelsStyle(si) <- "RefSeq"
Warning in (function (seqlevels, genome, new_style)  :
  cannot switch apiMel2's seqlevels from UCSC to RefSeq style
> si
Seqinfo object with 17 sequences from 2 genomes (Amel_2.0, apiMel2):
  seqnames    seqlengths isCircular   genome
  NC_007070.1   25433307      FALSE Amel_2.0
  NC_007071.1   14233499      FALSE Amel_2.0
  NC_007072.1   11790680      FALSE Amel_2.0
  NC_007073.1   10665743      FALSE Amel_2.0
  NC_007074.1   13312070      FALSE Amel_2.0
  ...                ...        ...      ...
  NC_007082.1    8846036      FALSE Amel_2.0
  NC_007083.1    8017760      FALSE Amel_2.0
  NC_007084.1    7326077      FALSE Amel_2.0
  NC_007085.1    5608962      FALSE Amel_2.0
  GroupUn      321143811      FALSE  apiMel2
> seqlevelsStyle(si) <- "UCSC"
> stopifnot(identical(si0, si))
> 
> si <- si0 <- Seqinfo(genome="WBcel235")
> si
Seqinfo object with 7 sequences (1 circular) from WBcel235 genome:
  seqnames seqlengths isCircular   genome
  I          15072434      FALSE WBcel235
  II         15279421      FALSE WBcel235
  III        13783801      FALSE WBcel235
  IV         17493829      FALSE WBcel235
  V          20924180      FALSE WBcel235
  X          17718942      FALSE WBcel235
  MT            13794       TRUE WBcel235
> seqlevelsStyle(si) <- "UCSC"
> si
Seqinfo object with 7 sequences (1 circular) from ce11 genome:
  seqnames seqlengths isCircular genome
  chrI       15072434      FALSE   ce11
  chrII      15279421      FALSE   ce11
  chrIII     13783801      FALSE   ce11
  chrIV      17493829      FALSE   ce11
  chrV       20924180      FALSE   ce11
  chrX       17718942      FALSE   ce11
  chrM          13794       TRUE   ce11
> seqlevelsStyle(si) <- "RefSeq"
> si
Seqinfo object with 7 sequences (1 circular) from WBcel235 genome:
  seqnames     seqlengths isCircular   genome
  NC_003279.8    15072434      FALSE WBcel235
  NC_003280.10   15279421      FALSE WBcel235
  NC_003281.10   13783801      FALSE WBcel235
  NC_003282.8    17493829      FALSE WBcel235
  NC_003283.11   20924180      FALSE WBcel235
  NC_003284.9    17718942      FALSE WBcel235
  NC_001328.1       13794       TRUE WBcel235
> seqlevelsStyle(si) <- "NCBI"
> stopifnot(identical(si0, si))
> 
> si <- Seqinfo(genome="macFas5")
> si
Seqinfo object with 7601 sequences (1 circular) from macFas5 genome:
  seqnames       seqlengths isCircular  genome
  chr1            227556264      FALSE macFas5
  chr2            192460366      FALSE macFas5
  chr3            192294377      FALSE macFas5
  chr4            170955103      FALSE macFas5
  chr5            189454096      FALSE macFas5
  ...                   ...        ...     ...
  chrUn_KE147191       5581      FALSE macFas5
  chrUn_KE147192      21004      FALSE macFas5
  chrUn_KE147193       5184      FALSE macFas5
  chrUn_KE147194       8082      FALSE macFas5
  chrUn_KE147195       7181      FALSE macFas5
> seqlevelsStyle(si) <- "NCBI"
Error in download.file(url, destfile, quiet = TRUE) : 
  cannot open URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/364/345/GCA_000364345.1_Macaca_fascicularis_5.0/GCA_000364345.1_Macaca_fascicularis_5.0_assembly_report.txt'
Calls: seqlevelsStyle<- ... .fetch_assembly_report_from_url -> fetch_table_from_url
Execution halted

This is what my script currently looks like

ORGANISM <- "Felis catus"

### List of assemblies by date.
ASSEMBLIES <- list(
    list(assembly="Felis_catus_9.0",
         date="2017/11/20",
         extra_info=c(breed="Abyssinian"),
         assembly_accession="GCF_000181335.3",  # felCat9
         circ_seqs=c("MT"))
)

Thank you

@kakopo Great to hear that the discussion in issue #44 helped you, and that Prisca was able to assist you via Slack!

To me, your Felis_catus.R file looks good. I don't know why you're getting this error in the examples of the seqlevelsStyle man page. Looks like some connectivity issue with the NCBI FTP site (ftp.ncbi.nlm.nih.gov). Probably a temporary thing.

Try to run the examples in an interactive R session by loading the GenomeInfoDb package and doing:

example(seqlevelsStyle)

This will run all the examples in the seqlevelsStyle man page (?seqlevelsStyle). Hopefully that works.

Then try R CMD check --no-tests again and maybe this time you'll have better luck. Yeah, examples and unit tests that access the internet can fail sometimes, in unpredictable ways. :disappointed:

If after a couple of attempts, you still run into this problem, then ignore.

I would just avoid wrapping "MT" inside c(), because c("MT") is the same as "M" (try identical(c("MT"), "M")), so you don't need it (it's actually not considered good R programming style).

Then add the file to git, commit, push, and submit a PR.

Thanks!

example(seqlevelsStyle) worked instantenously. Thank you. I hadn't realised I'd left the c("MT") part of the code - I blindly referenced another script and had no idea I had kept it in. Definitely a lesson in being more intentional! Otherwise, my curiosity lies in the fact that I noticed the fact that NCBI has 2 versions of the Felis catus assembly, with the more recent version containing a slightly updated GenBank assembly accession (but no RefSeq assembly accession). I used the older version, but what difference would using the newer one make, given that it seems incomplete? I've also been meaning to ask what exactly circ_seqs means within the code. I've just submitted the PR. Thanks for everything!

Hi @kakopo

example(seqlevelsStyle) worked instantenously.

Excellent!

I've also been meaning to ask what exactly circ_seqs means within the code.

circ_seqs stands for "circular sequences". All the sequences in a genome assembly are DNA sequences. Most of the time, DNA molecules are linear, but sometimes they are circular. For example, the autosomes in Human (that is, chromosomes 1-22, X, and Y) are linear DNA molecules. But the mitochondrial chromosome, which is a very small DNA molecule compared to the autosomes, is circular. Scaffolds are always considered linear.

For Cat, it's going to be the same: all the sequences in the assembly are linear, except for MT. We can check that MT is circular as follow:

Go to the NCBI assembly page for assembly GCF_000181335.3 and follow the link to the "Full sequence report" on the right of the page. See my comment here about what the "Full sequence report" is about.
Find the entry for MT in the report (tip: it's usually at the very bottom of the file).
Pick up the RefSeq accession number for this sequence. This is the 7th column in the file (the RefSeq-Accn column).
Use Google to lookup that number. The first result Google will return is a link to a record in the NCBI nucleotide database. Click on that link. This will take you to a page with title "Felis catus mitochondrion, complete genome".
Right under the title, check the "NCBI Reference Sequence: NC_001700.1" line. Make sure that it displays the correct RefSeq accession number. There could be more than one record with title "Felis catus mitochondrion, complete genome" in the NCBI nucleotide database, so we want to make sure that we are looking at the right one.
Let's look at the first line in the record, the LOCUS line. It says the following:
```
LOCUS       NC_001700              17009 bp    DNA     circular MAM 07-JUN-2021
```
This tells us that this sequence is a DNA sequence of length 17009 bp ("bp" stands for base pairs), and that it's a circular sequence.

At the bottom of the page, the ORIGIN section shows the sequence itself:

ORIGIN      
        1 ggactaatga ctaatcagcc catgatcaca cataactgtg gtgtcatgca tttggtattt
       61 tttattttta gggggtcgaa cttgctatga ctcagctatg acctaaaggt cctgactcag
      121 tcaaatatat tgttgctggg cttattctct atgcggggtc tccacacgca cagacagtca
    ...
    ...
    ...
    16861 cttgctcgaa cgtgtacctc ttctcgctcc gggcccattt caacgtgggg gtttctataa
    16921 cggaactata cctggcatct ggttcttact tcagggccat aaaatccttg aaaccaatcc
    16981 ttcagttctc tcaaatggga catctcgat

So now we have confirmation that the MT sequence in the Felis_catus_9.0 assembly is circular. Note that we've also marked the mitochondrial chromosome in UCSC genome felCat9 as circular (we'd better be consistent! :wink: ), by putting the following line in felCat9.R:

CIRC_SEQS <- "chrM"

chrM in felCat9 is the same sequence as MT in Felis_catus_9.0, only the names are different.

See https://en.wikipedia.org/wiki/Mitochondrial_DNA if you are interested to learn more about Mitochondrial DNA.

The two most important things to remember are:

The regular chromosomes (a.k.a. autosomes) are found in the nucleus of the cell. They are huge linear DNA sequences, (typically millions of base pairs, e.g. 242 million bp for Felis_catus_9.0!).
The mitochondrial chromosome is found in the mitochondrion, which is found outside the nucleus of the cell. It's a very small DNA sequence compared to the autosomes (typically a few thousands of base pairs only, e.g. 17009 bp for Felis_catus_9.0), and it's circular.

I noticed the fact that NCBI has 2 versions of the Felis catus assembly

I suppose you mean they have 2 assemblies named Felis_catus_9.0. Good catch!

The fact that they have several assemblies for the same organism (Felis catus in this case) is not a problem and is pretty common. The NCBI assembly page for GCF_000181335.3 has this blue box that says that there are 5 assemblies for this organism:

Screenshot from 2022-10-21 09-15-03

If you click on the link in that box, you will see the list of all NCBI assemblies for Felis catus.

But what's unusual and confusing here is that 2 of those assemblies are named Felis_catus_9.0. I honestly don't know why. Let's remember that NCBI is just a huge centralized repository where people can submit assemblies (these are usually the people working on the assemblies). I guess people are free to name the assembly as they want. However, a given submission will always have a unique GenBank or RefSeq assembly accession.

The lesson here is that assembly names are unreliable and messy. The only thing that we should rely on to identify an NCBI assembly unambiguously is a GenBank or RefSeq assembly accession.

In this particular case, we deliberately choose to register the Felis_catus_9.0 assembly associated with GCF_000181335.3, and not the other one, because this is the assembly that UCSC genome felCat9 is based on.

All is good with your PR. Just a very minor cosmetic detail. If you look at the Files changed, it shows a little red circle below the last line of the file. If you move the mouse over it, a little popup will say what the problem is. Can you please address that?

Thanks

Quite interesting! I've just done a bit of light reading on the difference between circular and linear chromosomes and everything is definitely making much more sense. This entire field is quite complex, and quite interesting to dive into. Thank you for answering my questions, and for the resources, always. I had made changes to get rid of the red circle. I am unsure as to why it is not reflecting as such, especially since I added the new line locally and pushed once more. My most recent commit is from 2 hours ago. Is this the current file on your end?

Screenshot from 2022-10-21 21-13-29

Looks good on my end. See https://github.com/Bioconductor/GenomeInfoDb/pull/64/files

PR #64 merged, thanks @kakopo

Next task in your group is issue #51. Whenever you are ready, go there and ask to be assigned.

Awesome, thank you as well

Bioconductor / GenomeInfoDb

Proposed contribution task for Outreachy applicants: Register NCBI assembly Felis_catus_9.0 #50