Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Proposed task for Outreachy applicants: Forge BSgenome data package for NCBI assembly Felis_catus_9.0 #43

Closed hpages closed 1 year ago

hpages commented 1 year ago

This task depends on this issue and issue #42 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences, of a particular genome assembly for a given organism. For example BSgenome.Amellifera.NCBI.AmelHAv3.1 is a BSgenome data package that contains the genomic sequences of NCBI assembly Amel_HAv3.1 (RefSeq assembly accession: GCF_003254395.2). Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.

This task's goal is to make a new BSgenome data package for NCBI assembly Felis_catus_9.0. The process of making such package is partially documented in the "How to forge a BSgenome data package" vignette from the BSgenome software package. The landing page for the BSgenome package contains a link to this vignette.

Here's some additional information

For an NCBI assembly, the sequence data can be found in the form of compressed FASTA files located in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly associated with the assembly. You'll find the links to those FTP directories on the right side of the NCBI page of the Felis_catus_9.0 assembly, below the link to the Full sequence report.

Note that NCBI uses the .fna suffix for uncompressed FASTA files (UCSC uses the .fa suffix). For compressed FASTA files, NCBI and UCSC both add the .gz suffix to the .fna or .fa suffix. As you will see, NCBI can provide several .fna.gz files in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly.

The file that contains all the DNA sequences for the assembly is expected to have a name that looks like this: <RefSeq assembly accession>_<assembly name>_genomic.fna.gz in the FTP directory for RefSeq assembly, and <GenBank assembly accession>_<assembly name>_genomic.fna.gz in the FTP directory for GenBank assembly.

How you will proceed

The first thing you should do is download the appropriate FASTA file. Choose the RefSeq one if Felis_catus_9.0 was registered with its RefSeq assembly accession, or the GenBank one if it was registered with its GenBank assembly accession.

Then load it into R with the readDNAStringSet() function from the Biostrings package. The object returned by readDNAStringSet() is a DNAStringSet object. Take a look at the names on this object (call names() on the object to extract them, or call head(names()) to look at the 6 first names only). As you can see, these names are long and a little bit cryptic. They don't exactly look like the sequence names returned by getChromInfoFromNCBI("Felis_catus_9.0") :cry: We'll take care of this below.

The fasta_to_sorted_2bit.R script

As we know, the forging process only supports the two following options for the input sequence data: a single 2bit file, or a collection of FASTA files with one file per sequence. This means that we won't be able to use the <assembly accession>_<assembly name>_genomic.fna.gz file as-is. We will first need to preprocess the file as follow: (1) either convert the file to the 2bit format, or (2) split the file into a collection of smaller FASTA files where each file contains only one sequence. It is recommended to do (1).

Here are some examples of R scripts that have been used to perform this conversion for other NCBI assemblies:

Note that these scripts also take care of two important things:

  1. They rename the sequences, that is, they replace the long and cryptic names on the DNAStringSet object returned by readDNAStringSet() with the sequence names returned by getChromInfoFromNCBI().
  2. They put the sequences in the same order as in the result returned by getChromInfoFromNCBI().

Choose one script, download it or copy it to your computer, rename it fasta_to_sorted_2bit_for_Felis_catus_9.0.R, and adapt it to make it work for your <assembly accession>_<assembly name>_genomic.fna.gz file. When you work on this, it will help you if you try to execute the script line by line, and take the time to study what each line is doing. Don't hesitate to inspect each intermediate result. Name the output file produced by your script Felis_catus_9.0.sorted.2bit.

Once the script is ready, please add it to your fork of the BSgenome package, in the inst/extdata/Outreachy/ folder.

The seed file

The seed file that you're going to prepare for the forging process should point to the 2bit file that you produced with your fasta_to_sorted_2bit.R script above.

Prepare the file and use it to forge the BSgenome data package.

Test the new package like you did with BSgenome.Fcatus.UCSC.felCat9. That is: install it, start R, load the package, and play around with it. Proceed with some "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.

Once you are confident that your new BSgenome data package works as expected, please add your seed file to your fork of the BSgenome package, in the inst/extdata/Outreachy/ folder.

Finally commit, push, and submit a PR.

kakopo commented 1 year ago

Hello @hpages. Could I please be assigned to this task?

hpages commented 1 year ago

Of course!

BTW, check the landing page for the GenomeInfoDb package (pay attention to the Author field): https://bioconductor.org/packages/devel/GenomeInfoDb :tada:

kakopo commented 1 year ago

I'm honestly elated, @hpages! I actually can't stop grinning ear to ear. Thank you! :grin: :grin:

kakopo commented 1 year ago

Hi @hpages. I think I'm done with forging the package. All (tentatively) works as expected on my end, and the content within the package folders looks dine. Looking for the source URL was a minor challenge (I just pasted a modified version of the URL I used in issue #42. I hope that's fine).

Running some of the tests was also a bit tricky. Editing this section of code as seen in issue #46

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeq")
txdb

was a point of query? How would one adapt this to carry out the same tests for GCF_000181335.3/Felis_catus_9.0? Is it even possible in this case?

Otherwise, I'm curious where exactly in the code the sequence names in the object storing the values from readDNAStringSet("GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz") are changed to match those in getChromInfoFromNCBI("Felis_catus_9.0"). I suspect it is around line 17 and I think I've kind of got it- but I would be grateful for a small explanation to help me understand better (or correct me, if my suspicion is wrong :laughing:)

I have just submitted my script and seed file for a PR, and your perusal. I hope all is satisfactory.

hpages commented 1 year ago

Hi @kakopo ,

Running some of the tests was also a bit tricky. Editing this section of code as seen in issue #46

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeq")
txdb

was a point of query? How would one adapt this to carry out the same tests for GCF_000181335.3/Felis_catus_9.0? Is it even possible in this case?

Please take a look at what I posted here this morning for an updated (and simplified) version of what I'd like you to try. You'll need to adapt slightly to make it work for felCat9/Felis_catus_9.0. Let me know how it goes.

Otherwise, I'm curious where exactly in the code the sequence names in the object storing the values from readDNAStringSet("GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz") are changed to match those in getChromInfoFromNCBI("Felis_catus_9.0"). I suspect it is around line 17 and I think I've kind of got it- but I would be grateful for a small explanation to help me understand better (or correct me, if my suspicion is wrong :laughing:)

Short answer: match() does that.

Detailed answer:

Your fasta_to_sorted_2bit_for_Felis_catus_9.0.R script does the following:

library(Biostrings)

### Download GCF_000181335_Felis_catus_9.0_genomic.fna.gz from
### https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/181/335/GCF_000181335.3_Felis_catus_9.0/GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz
dna <- readDNAStringSet("GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz")
head(names(dna)) #inspect first 6 names on object

### Check seqnames.
current_RefSeqAccn <- unlist(heads(strsplit(names(dna), " ", fixed=TRUE), n=1L))
library(GenomeInfoDb)
chrominfo <- getChromInfoFromNCBI("GCF_000181335.3")
expected_RefSeqAccn <- chrominfo[ , "RefSeqAccn"]
stopifnot(setequal(expected_RefSeqAccn, current_RefSeqAccn))

### Reorder sequences.
dna <- dna[match(expected_RefSeqAccn, current_RefSeqAccn)]

### Rename sequences.
names(dna) <- chrominfo[ , "SequenceName"]

### Export as 2bit.
library(rtracklayer)
export.2bit(dna, "Felis_catus.9.0.sorted.2bit")

The key to try to understand a piece of code like the above is to execute it line by line and to inspect each intermediate result. So let's do that:

library(Biostrings)
dna <- readDNAStringSet("GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz")
dna
# DNAStringSet object of length 4508:
#            width seq                                        names               
#    [1] 242100913 ATCAGGAGATCTAGATGCCT...AGCACCTTCATGTTCCCAA NC_018723.3 Felis...
#    [2]     46965 CTTTCTTTTTCTAAAATTCT...ACCAATTATATGGGACTAG NW_019365239.1 Fe...
#    [3]     58068 AAATCGTGACACATGCTACA...CCTCCTGGGCCTTCTCAGC NW_019365240.1 Fe...
#    [4]     50743 AGTTATAGTAATCTTCCTAG...CTGCCTTCCTTTTCTTTTC NW_019365241.1 Fe...
#    [5]     22574 CATGATTTAGTGAAAACGTA...TCTATTTATCACATTTGTT NW_019365242.1 Fe...
#    ...       ... ...
# [4504]     11988 AATCATTGATGAGATTAGGA...TTGCATTTACCTAATGTAC NW_019369723.1 Fe...
# [4505]      6988 AGTATAAGCCTGGACCGAGT...AAGACAAAACAAACAAAAC NW_019369724.1 Fe...
# [4506]      6987 CATATCGTGGAGTGGGGGCA...TCACCTTGGGATAAAGGCG NW_019369725.1 Fe...
# [4507]     10984 ATCGTGCAGAAGTAGGTCTG...TATCTTTAAAACAGGAAAC NW_019369726.1 Fe...
# [4508]     17009 GGACTAATGACTAATCAGCC...TCAAATGGGACATCTCGAT NC_001700.1 Felis...

head(names(dna))
# [1] "NC_018723.3 Felis catus isolate Cinnamon breed Abyssinian chromosome A1, Felis_catus_9.0, whole genome shotgun sequence"                                                    
# [2] "NW_019365239.1 Felis catus isolate Cinnamon breed Abyssinian chromosome A1 unlocalized genomic scaffold, Felis_catus_9.0 chrA1_random_ctg382, whole genome shotgun sequence"
# [3] "NW_019365240.1 Felis catus isolate Cinnamon breed Abyssinian chromosome A1 unlocalized genomic scaffold, Felis_catus_9.0 chrA1_random_ctg280, whole genome shotgun sequence"
# [4] "NW_019365241.1 Felis catus isolate Cinnamon breed Abyssinian chromosome A1 unlocalized genomic scaffold, Felis_catus_9.0 chrA1_random_ctg334, whole genome shotgun sequence"
# [5] "NW_019365242.1 Felis catus isolate Cinnamon breed Abyssinian chromosome A1 unlocalized genomic scaffold, Felis_catus_9.0 chrA1_random_ctg671, whole genome shotgun sequence"
# [6] "NW_019365243.1 Felis catus isolate Cinnamon breed Abyssinian chromosome A1 unlocalized genomic scaffold, Felis_catus_9.0 chrA1_random_ctg330, whole genome shotgun sequence"

We see long and cryptic names on dna. The identifiers at the beginning of the names are the RefSeq identifiers of the sequences. We've seen them before, in the data frame returned by getChromInfoNCBI("Felis_catus_9.0"):

chrominfo <- getChromInfoFromNCBI("Felis_catus_9.0")
expected_RefSeqAccn <- chrominfo$RefSeqAccn
expected_RefSeqAccn[1:40]
#  [1] "NC_018723.3"    "NC_018724.3"    "NC_018725.3"    "NC_018726.3"   
#  [5] "NC_018727.3"    "NC_018728.3"    "NC_018729.3"    "NC_018730.3"   
#  [9] "NC_018731.3"    "NC_018732.3"    "NC_018733.3"    "NC_018734.3"   
# [13] "NC_018735.3"    "NC_018736.3"    "NC_018737.3"    "NC_018738.3"   
# [17] "NC_018739.3"    "NC_018740.3"    "NC_018741.3"    "NC_001700.1"   
# [21] "NW_019365248.1" "NW_019365240.1" "NW_019365243.1" "NW_019365239.1"
# [25] "NW_019365247.1" "NW_019365246.1" "NW_019365242.1" "NW_019365245.1"
# [29] "NW_019365241.1" "NW_019365244.1" "NW_019365269.1" "NW_019365271.1"
# [33] "NW_019365268.1" "NW_019365250.1" "NW_019365261.1" "NW_019365270.1"
# [37] "NW_019365265.1" "NW_019365272.1" "NW_019365254.1" "NW_019365262.1"

But as we can see, the sequences in dna are not in the same order as in the chrominfo data frame.

Let's extract the RefSeq identifiers from the names on dna:

current_RefSeqAccn <- unlist(heads(strsplit(names(dna), " ", fixed=TRUE), n=1L))

Here it's important to take the time to study how this extraction works. There are 3 steps involved: strsplit(), heads(), and then unlist(), in that order. Make sure to look at each intermediate result i.e. what's returned by strsplit() and then what's returned by heads(). After the unlist() step, we end up with a character vector (current_RefSeqAccn) that is parallel to dna, that is, dna and current_RefSeqAccn have the same length and the i-th element in the former corresponds to the i-th element in the latter. Let's check this on the 6 first and last elements of each object:

head(dna)
# DNAStringSet object of length 6:
#         width seq                                           names               
# [1] 242100913 ATCAGGAGATCTAGATGCCTG...GAAGCACCTTCATGTTCCCAA NC_018723.3 Felis...
# [2]     46965 CTTTCTTTTTCTAAAATTCTC...ACACCAATTATATGGGACTAG NW_019365239.1 Fe...
# [3]     58068 AAATCGTGACACATGCTACAT...GGCCTCCTGGGCCTTCTCAGC NW_019365240.1 Fe...
# [4]     50743 AGTTATAGTAATCTTCCTAGG...TCCTGCCTTCCTTTTCTTTTC NW_019365241.1 Fe...
# [5]     22574 CATGATTTAGTGAAAACGTAA...TTTCTATTTATCACATTTGTT NW_019365242.1 Fe...
# [6]     50951 TACTGAATCCTGTGATGTTGG...ATCCCAAACCACCTGCACATT NW_019365243.1 Fe...

head(current_RefSeqAccn)
# [1] "NC_018723.3"    "NW_019365239.1" "NW_019365240.1" "NW_019365241.1"
# [5] "NW_019365242.1" "NW_019365243.1"

tail(dna)
# DNAStringSet object of length 6:
#     width seq                                               names               
# [1] 10244 ACACAGTATCAGTCATGTGCAAT...GTTATATTATCAGGAGCTTGGTC NW_019369722.1 Fe...
# [2] 11988 AATCATTGATGAGATTAGGAAAT...TTAATTGCATTTACCTAATGTAC NW_019369723.1 Fe...
# [3]  6988 AGTATAAGCCTGGACCGAGTGCA...GAACAAGACAAAACAAACAAAAC NW_019369724.1 Fe...
# [4]  6987 CATATCGTGGAGTGGGGGCATTT...AAAGTCACCTTGGGATAAAGGCG NW_019369725.1 Fe...
# [5] 10984 ATCGTGCAGAAGTAGGTCTGTAG...ACTTTATCTTTAAAACAGGAAAC NW_019369726.1 Fe...
# [6] 17009 GGACTAATGACTAATCAGCCCAT...TCTCTCAAATGGGACATCTCGAT NC_001700.1 Felis...

tail(current_RefSeqAccn)
# [1] "NW_019369722.1" "NW_019369723.1" "NW_019369724.1" "NW_019369725.1"
# [5] "NW_019369726.1" "NC_001700.1"   

In other words, the RefSeq identifiers in current_RefSeqAccn are in the same order as in dna.

Now the RefSeq identifiers in current_RefSeqAccn are expected to be the same as those in expected_RefSeqAccn but not necessarily in the same order. This is confirmed by our sanity check:

stopifnot(setequal(expected_RefSeqAccn, current_RefSeqAccn))

It's important to note that we use setequal() instead of identical() in this sanity check, because the two vectors don't necessarily have their elements in the same order.

But we can use match() to match each element in expected_RefSeqAccn to its corresponding element in current_RefSeqAccn:

oo <- match(expected_RefSeqAccn, current_RefSeqAccn)

match() is a very powerful function in R. See ?match in the base package for more information.

The object returned by our call to match() above is an integer vector parallel to match's first argument:

head(oo)
# [1]  1 12 37 56 68 97

For each element in expected_RefSeqAccn, it tells us where to find that element in current_RefSeqAccn (i.e. at what position).

So for example, the 1st element in expected_RefSeqAccn is also the 1st element in current_RefSeqAccn, the 2nd element in expected_RefSeqAccn is the 12th element in current_RefSeqAccn, the 3rd element in expected_RefSeqAccn is the 37th element in current_RefSeqAccn, etc... We can easily check this:

expected_RefSeqAccn[1]
# [1] "NC_018723.3"
current_RefSeqAccn[1]
# [1] "NC_018723.3"

expected_RefSeqAccn[2]
# [1] "NC_018724.3"
current_RefSeqAccn[12]
# [1] "NC_018724.3"

expected_RefSeqAccn[3]
# [1] "NC_018725.3"
current_RefSeqAccn[37]
# [1] "NC_018725.3"

Here's an important property of the oo vector: if you use it to subset match's second argument, it will return a vector identical to match's first argument. Let's check that:

identical(current_RefSeqAccn[oo], expected_RefSeqAccn)
# [1] TRUE

This means that we can also use oo to reorder dna (because dna and current_RefSeqAccn are parallel). More precisely, if we use oo to subset dna, this will reorder the elements in dna to put them in the same order as in expected_RefSeqAccn.

dna <- dna[oo]

Let's take a look at dna:

dna
# DNAStringSet object of length 4508:
#            width seq                                        names               
#    [1] 242100913 ATCAGGAGATCTAGATGCCT...AGCACCTTCATGTTCCCAA NC_018723.3 Felis...
#    [2] 171471747 GGTCTGAGCTGTCAAGCACA...GGTGAAGCCTAATTTAGCA NC_018724.3 Felis...
#    [3] 143202405 GAGGCAGCGCCGACTCTGAG...TAGGGTTAGGGCTTAGGGT NC_018725.3 Felis...
#    [4] 208212889 TGATAACTGGTGTCCTATAG...GTGAGTCCATGAAGAATAG NC_018726.3 Felis...
#    [5] 155302638 TACCCATGGGGAACAAATAC...TTTTTTTTTTTTTTTTTTT NC_018727.3 Felis...
#    ...       ... ...
# [4504]     12747 TTGTGCTCCAGCCTTTCAAG...GGGGTTTACTTCTGTGCAA NW_019368042.1 Fe...
# [4505]      9876 CTGAATTACGCCACACATTA...TAATGGCAACAAAATCTAT NW_019367791.1 Fe...
# [4506]     22013 GTTTAACCTTGGCACGGTTG...TTTTGAATCACTATAGTAA NW_019368556.1 Fe...
# [4507]      8987 GCAGCGATTCTGGAGAGTGC...CTGAATGAATGGTGTCTGT NW_019368494.1 Fe...
# [4508]    419391 CCCTGCCAAGCAGAACTCTG...TTCGGGGCTTCATCTCTGC NW_019365585.1 Fe...

For what we can see, and based on their names, the sequences in dna now seem to be in the same order as in expected_RefSeqAccn:

head(expected_RefSeqAccn)
# [1] "NC_018723.3" "NC_018724.3" "NC_018725.3" "NC_018726.3" "NC_018727.3"
# [6] "NC_018728.3"

tail(expected_RefSeqAccn)
# [1] "NW_019368113.1" "NW_019368042.1" "NW_019367791.1" "NW_019368556.1"
# [5] "NW_019368494.1" "NW_019365585.1"

Now that the sequences in dna are ordered as in the chrominfo data frame, we can extract the official NCBI sequence names from the latter and put them on the former:

names(dna) <- chrominfo[ , "SequenceName"]

dna
# DNAStringSet object of length 4508:
#            width seq                                        names               
#    [1] 242100913 ATCAGGAGATCTAGATGCCT...AGCACCTTCATGTTCCCAA chrA1
#    [2] 171471747 GGTCTGAGCTGTCAAGCACA...GGTGAAGCCTAATTTAGCA chrA2
#    [3] 143202405 GAGGCAGCGCCGACTCTGAG...TAGGGTTAGGGCTTAGGGT chrA3
#    [4] 208212889 TGATAACTGGTGTCCTATAG...GTGAGTCCATGAAGAATAG chrB1
#    [5] 155302638 TACCCATGGGGAACAAATAC...TTTTTTTTTTTTTTTTTTT chrB2
#    ...       ... ...
# [4504]     12747 TTGTGCTCCAGCCTTTCAAG...GGGGTTTACTTCTGTGCAA chrUn_ctg1642
# [4505]      9876 CTGAATTACGCCACACATTA...TAATGGCAACAAAATCTAT chrUn_ctg3044
# [4506]     22013 GTTTAACCTTGGCACGGTTG...TTTTGAATCACTATAGTAA chrUn_ctg753
# [4507]      8987 GCAGCGATTCTGGAGAGTGC...CTGAATGAATGGTGTCTGT chrUn_ctg2757
# [4508]    419391 CCCTGCCAAGCAGAACTCTG...TTCGGGGCTTCATCTCTGC chrUn_Scaffold_147

Hope this helps.

Let me know if you have further questions.

hpages commented 1 year ago

How are things going @kakopo? I've not heard from you for a few days so I'm just checking. Any blocker or question? Don't hesitate. Thanks!

kakopo commented 1 year ago

Hi @hpages. Things are going okay. I've just fallen a bit ill so moving in and out of the hospital has made me unable to put in my 100% in the last few days! Still, I'm having a slow but optimistic recovery so its back to business! Earlier, I was focused on the task in issue #72 but I've been making slow progress. I shall direct my queries regarding that task there. Actually, as a result of my tinkering I think I altered my GenomeInfoDb package and was unable to access some functions, such as seqinfo(), till I redownloaded it from the Bioconductor repository. I was able to carry out the tests but ran into a new issue. Earlier, running

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeq")
txdb

was unable to work (actually, it returned this

Download the ncbiRefSeq table ... Error:  [0]
OK

and running txdb afterwards returned the same results as using txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeqCurated") so I am curious as to whether ncbiRefSeqCurated is an improved version of ncbiRefSeq and what exactly it is)

This time, I ran

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeqCurated")
txdb

which worked, but brought this warning;

Download the ncbiRefSeqCurated table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
  UCSC data anomaly in 16 transcript(s): the cds cumulative length is not a multiple of 3 for
  transcripts ‘NM_001309037.1’ ‘NM_001165900.1’ ‘NM_001243085.1’ ‘NM_001137656.1’
  ‘NM_001291089.1’ ‘NM_001009876.1’ ‘NM_001009367.1’ ‘NM_001009267.1’ ‘NM_001009334.1’
  ‘NM_001009371.2’ ‘NM_001079828.1’ ‘NP_008251.1’ ‘NP_008252.1’ ‘NP_008257.1’ ‘NP_008258.1’
  ‘NP_008260.1’

and consequently, running > identical(seqinfo(txdb), getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE)) returned [1] FALSE - which was expected, from the warning given earlier, but I am unsure why the warning appears.

> seqinfo(Fcatus)
Seqinfo object with 4508 sequences (1 circular) from Felis_catus_9.0 genome:
  seqnames           seqlengths isCircular          genome
  chrA1               242100913      FALSE Felis_catus_9.0
  chrA2               171471747      FALSE Felis_catus_9.0
  chrA3               143202405      FALSE Felis_catus_9.0
  chrB1               208212889      FALSE Felis_catus_9.0
  chrB2               155302638      FALSE Felis_catus_9.0
  ...                       ...        ...             ...
  chrUn_ctg1642           12747      FALSE Felis_catus_9.0
  chrUn_ctg3044            9876      FALSE Felis_catus_9.0
  chrUn_ctg753            22013      FALSE Felis_catus_9.0
  chrUn_ctg2757            8987      FALSE Felis_catus_9.0
  chrUn_Scaffold_147     419391      FALSE Felis_catus_9.0

I don't know if this is much of an issue, but I expected running seqinfo(txdb) to be in a similar format- to start with the alphabetical order of the chromosomes. Instead, the result seems somewhat changed, which is unexpected.

> seqinfo(txdb)
Seqinfo object with 4508 sequences (1 circular) from Felis_catus_9.0 genome:
  seqnames            seqlengths isCircular          genome
  chrX                 130557009      FALSE Felis_catus_9.0
  MT                       17009       TRUE Felis_catus_9.0
  chrX_random_ctg395       53285      FALSE Felis_catus_9.0
  chrX_random_ctg158      128502      FALSE Felis_catus_9.0
  chrX_random_ctg365       47922      FALSE Felis_catus_9.0
  ...                        ...        ...             ...
  chrF2_random_ctg405      44543      FALSE Felis_catus_9.0
  chrF2_random_ctg483      46796      FALSE Felis_catus_9.0
  chrF2_random_ctg296      55998      FALSE Felis_catus_9.0
  chrF2_random_ctg681      25257      FALSE Felis_catus_9.0
  chrF2_random_ctg552      32481      FALSE Felis_catus_9.0

I ran the rest of the tests and this happened.

> getSeq(Fcatus, transcript_ranges)
Error in mergeNamedAtomicVectors(genome(x), genome(y), what = c("sequence",  : 
  sequences chrA1, chrA2, chrA3, chrB1, chrB2, chrB3, chrB4, chrC1, chrC2, chrD1, chrD2, chrD3, chrD4, chrE1, chrE2, chrE3, chrF1, chrF2, chrX have incompatible genomes:
  - in 'x': felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9
  - in 'y': Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0

I wonder whether these issues may have come about because of my edits to the GenomeInfoDb package, or from something else? These new tests are helpful, and much more clear still. I've been able to catch previously unseen discrepancies. Thank you! I also really appreciated the explanation for how the fasta_to_sorted_2bit_for_Felis_catus_9.0.R script worked, especially;

Here's an important property of the oo vector: if you use it to subset match's second argument, it will return a vector identical to match's first argument. Let's check that:

identical(current_RefSeqAccn[oo], expected_RefSeqAccn) [1] TRUE

This means that we can also use oo to reorder dna (because dna and current_RefSeqAccn are parallel). More precisely, if we use oo to subset dna, this will reorder the elements in dna to put them in the same order as in expected_RefSeqAccn.

That's an amazing property! Its really cool that masterfully putting together different methods enhances their functionality- I had gathered that match() returned integer vectors showing where the elements "matched" so as to say, but did not see what exactly rearranged them. Thank you for the masterclass! It was very helpful and enlightening.

hpages commented 1 year ago

Hi @kakopo ,

I'm so sorry to hear that you got ill. I hope it's nothing too serious and that you'll recover quickly and fully.

Earlier, running

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeq")
txdb

was unable to work (actually, it returned this

Download the ncbiRefSeq table ... Error:  [0]
OK

Not totally unexpected. makeTxDbFromUCSC() retrieves the data it needs to create the TxDb object by querying the UCSC's MySQL server (it uses the RMariaDB package for that). Sometimes this retrieval fails because of connectivity problems that are outside our control. When this happens we usually see a more informative error message than the one you got here though. This Error: [0] you got is strange, never seen it before.

Also what doesn't help is that the function prints an OK. This is actually quite misleading, because makeTxDbFromUCSC() was not able to produce the TxDb object, so obviously things are not OK. I modified the makeTxDbFromUCSC() function in the GenomicFeatures package this morning to address this. See this commit if you are curious. The modified function is in GenomicFeatures 1.50.0, which will become available in the next 24h or so to BioC 3.16 users via BiocManager::install().

Anyways, when something like the above happens, I usually try to call the function again and it will eventually work.

and running txdb afterwards returned the same results as using txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeqCurated") so I am curious as to whether ncbiRefSeqCurated is an improved version of ncbiRefSeq and what exactly it is)

If you called makeTxDbFromUCSC() earlier, and if the call was successful and returned a TxDb object, then the txdb variable got created and its value set to the TxDb object returned by the function call. So when later on you call makeTxDbFromUCSC() again and it fails, then the txdb variable doesn't get modified, that is, it will still contain the original TxDb object. Here is very simple code that illustrates this:

foo <- function() {stop("something went wrong"); 55}

a <- 88

a <- foo()
# Error in foo() : something went wrong

## The call to foo() failed (i.e. raised an error) so didn't get a chance to terminate
## normally and to return the value 55. So 'a' didn't get modified.

a
# [1] 88

Note that this behavior is not specific to R. AFAIK we see the same thing in other programming languages like Python.

ncbiRefSeq and ncbiRefSeqCurated are the names of SQL tables on UCSC's MySQL server. The latter table contains a curated version of the former. The difference between the 2 tables is not really important with regards to what we're discussing here. To keep this short I'll just say that the latter is a subset of the data in the former, that consists of the data that is "trusted" only. From a practical point of view, this means that ncbiRefSeqCurated is much smaller (i.e. has less rows) than ncbiRefSeq, so is a little bit faster to download.

which worked, but brought this warning;

Download the ncbiRefSeqCurated table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
  UCSC data anomaly in 16 transcript(s): the cds cumulative length is not a multiple of 3 for
  transcripts ‘NM_001309037.1’ ‘NM_001165900.1’ ‘NM_001243085.1’ ‘NM_001137656.1’
  ‘NM_001291089.1’ ‘NM_001009876.1’ ‘NM_001009367.1’ ‘NM_001009267.1’ ‘NM_001009334.1’
  ‘NM_001009371.2’ ‘NM_001079828.1’ ‘NP_008251.1’ ‘NP_008252.1’ ‘NP_008257.1’ ‘NP_008258.1’
  ‘NP_008260.1’

and consequently, running > identical(seqinfo(txdb), getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE)) returned [1] FALSE - which was expected, from the warning given earlier, but I am unsure why the warning appears.

The warning is expected (I get it too) and should not affect what seqinfo(txdb) does on the returned object.

But you're right, identical(seqinfo(txdb), getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE)) returns FALSE (I just tried it and that's what I got too). Now that's not what I expected so I checked makeTxDbFromUCSC()'s implementation again and it looks like it's doing something wrong. So congratulations! You just found a bug in makeTxDbFromUCSC(). But also I'm sorry for that and I apologize for the confusion (I'll fix the bug right after I finish to type this long comment).

I expected running seqinfo(txdb) to be in a similar format- to start with the alphabetical order of the chromosomes. Instead, the result seems somewhat changed, which is unexpected.

You're right, this is unexpected. This is because of this bug in makeTxDbFromUCSC() that I mention above and that I'm going to fix ASAP.

I ran the rest of the tests and this happened.

getSeq(Fcatus, transcript_ranges)
Error in mergeNamedAtomicVectors(genome(x), genome(y), what = c("sequence",  : 
  sequences chrA1, chrA2, chrA3, chrB1, chrB2, chrB3, chrB4, chrC1, chrC2, chrD1, chrD2, chrD3, chrD4, chrE1, chrE2, chrE3, chrF1, chrF2, chrX have incompatible genomes:
  - in 'x': felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9, felCat9
  - in 'y': Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0, Felis_catus_9.0

This error is expected. Note that in this comment that you were following here I also wrote: "Does this work for you? What happens if we switch the sequence names in transcript_ranges back to the UCSC names (with seqlevelsStyle(transcript_ranges) <- "UCSC"). Does getSeq(Xtropicalis, transcript_ranges) still work? What error do you get?"

The purpose of this last exercise was to touch the topic of sequence names harmonization. The sequence names in Fcatus use a different convention than the sequence names in transcript_ranges. The former uses the NCBI names and the latter the UCSC names. Note that in this particular case the sequence names are actually the same for the chromosomes, but not for the scaffolds. In the general case though (i.e. for other assemblies), the NCBI and UCSC names are quite different. This is a frequent situation in biocinformatics and a constant source of frustration. The analyst often needs to harmonize the sequence names used in various datasets as a preliminary step to their analysis.

This harmonization of the sequence names is very easy to do in Bioconductor. We do it by calling the seqlevelsStyle() setter on one of the two objects. So in the above example, we could do this harmonization with:

seqlevelsStyle(Fcatus) <- seqlevelsStyle(transcript_ranges)

or with:

seqlevelsStyle(transcript_ranges) <- seqlevelsStyle(Fcatus)

It works both ways.

So try this again:

library(BSgenome.Fcatus.NCBI.9.0)
library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeqCurated")
transcript_ranges <- transcripts(txdb)
getSeq(Fcatus, transcript_ranges)  # error!

and solve the problem by harmonizing the sequence names in Fcatus and transcript_ranges.

I'm going to fix that bug in makeTxDbFromUCSC() now...

kakopo commented 1 year ago

Awesome! I tried this out and everything worked out. I was a bit floored since the error message was different from what I'd expected, but harmonizing the sequence names did the trick. I can imagine how frustrating this might be for a bioinformatics analyst. I also ran txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeq") again and it returned output akin to txdb <- makeTxDbFromUCSC("felCat9", tablename="ncbiRefSeqCurated") so it must have been a connectivity issue, in spite of the error message (or lack of). I now see that the warning regarding the UCSC data anomaly in the transcripts is normal. Thank you for the explanation always!

hpages commented 1 year ago

Great! I'm closing this.

Don't forget to record the completion of this task on Outreachy.

Now moving on to https://github.com/Bioconductor/GenomeInfoDb/issues/72 where I see that you have some questions for me. I'll answer there shortly.