Closed hpages closed 1 year ago
Please assign me to this task. Thank you
Done.
Don't forget to record #40 on Outreachy.
Done.
Don't forget to record #40 on Outreachy.
I will do that sir. Thank you
Sir, I have been able to create the UCB_Xtro_10.0.sorted.2bit
file and forged the package, but when I run R CMD build BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0
, I get the error below.
* preparing ‘BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0’:
* checking DESCRIPTION meta-information ... ERROR
Malformed package name```
Below is the content of my seed file:
Package: BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0
Title: Full genome sequences for Xenopos tropicalis (NCBI version UCB_Xtro_10.0)
Description: Full genome sequences for Xenopus tropicalis as provided by NCBI (UCB_Xtro_10.0, 2019/11/14) and stored in Biostrings objects.
Version: 0.1.0
organism: Xenopus tropicalis
common_name: Tropical clawed frog
genome: UCB_Xtro_10.0
provider: NCBI
release_date: 2019/11/14
source_url: https://www.ncbi.nlm.nih.gov/assembly/GCF_000004195.4/
organism_biocview: Xenopus_tropicalis
BSgenomeObjname: Xtropicalis
SrcDataFiles: GCF_000004195.4_UCB_Xtro_10.0_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/195/GCF_000004195.4_UCB_Xtro_10.0/
PkgExamples: genome$chr1 # same as genome[["chr1"]]
seqs_srcdir: /home/emmanuel/seqdatafile
seqfile_name: UCB_Xtro_10.0.sorted.2bit```.
I tried changing the name of the seed file to BSgenome.Xtropicalis.NCBI.10.0-seed
and changing the package name to BSgenome.Xtropicalis.NCBI.10.0
in the seed but I dont know if the name follows the NCBI standard. Nonetheless, it was able to build successfully but when i ran R CMD check
on the tarball produced from building, i got the error below:
checking examples ... ERROR
Running examples in ‘BSgenome.Xtropicalis.NCBI.10.0-Ex.R’ failed
The error most likely occurred in:
> ### Name: BSgenome.Xtropicalis.NCBI.10.0
> ### Title: Full genome sequences for Xenopos tropicalis (NCBI version
> ### UCB_Xtro_10.0)
> ### Aliases: BSgenome.Xtropicalis.NCBI.10.0-package
> ### BSgenome.Xtropicalis.NCBI.10.0 Xtropicalis
> ### Keywords: package data
>
> ### ** Examples
>
> BSgenome.Xtropicalis.NCBI.10.0
Tropical clawed frog genome:
# organism: Xenopus tropicalis (Tropical clawed frog)
# genome: UCB_Xtro_10.0
# provider: NCBI
# release date: 2019/11/14
# 167 sequences:
# Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 MT
# Sca1 Sca2 Sca3 Sca4 Sca5 Sca6 Sca7 Sca8 Sca9 Sca10 Sca11
# Sca12 Sca13 Sca14 Sca15 Sca16 Sca17 Sca18 Sca19 Sca20 Sca21 Sca22
# Sca23 Sca24 Sca25 Sca26 Sca27 Sca28 Sca29 Sca30 Sca31 Sca32 Sca33
# Sca34 Sca35 Sca36 Sca37 Sca38 Sca39 Sca40 Sca41 Sca42 Sca43 Sca44
# ... ... ... ... ... ... ... ... ... ... ...
# Sca111 Sca112 Sca113 Sca114 Sca115 Sca116 Sca117 Sca118 Sca119 Sca120 Sca121
# Sca122 Sca123 Sca124 Sca125 Sca126 Sca127 Sca128 Sca129 Sca130 Sca131 Sca132
# Sca133 Sca134 Sca135 Sca136 Sca137 Sca138 Sca139 Sca140 Sca141 Sca142 Sca143
# Sca144 Sca145 Sca146 Sca147 Sca148 Sca149 Sca150 Sca151 Sca152 Sca153 Sca154
# Sca155 Sca156
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> genome <- BSgenome.Xtropicalis.NCBI.10.0
> head(seqlengths(genome))
Chr1 Chr2 Chr3 Chr4 Chr5 Chr6
217471166 181034961 153873357 153961319 164033575 154486312
> genome[["chr1"]]
Error in genome[["chr1"]] : no such sequence
Calls: [[ -> [[
Execution halted
Hi @Simplecodez
The error you got earlier (Malformed package name
) was because you had underscores (_
) in the name of the package: BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0
. R does not allow that. In this case, you can just remove the underscores from the package name. This will lead to the following name: BSgenome.Xtropicalis.NCBI.UCBXtro10.0
.
Note that we try to have the last part in the package name to be the same as the assembly name. However, as you can see, it's not always possible. In the case of BSgenome.Xtropicalis.NCBI.UCBXtro10.0
, the last part is close but not exactly equal to the assembly name. It's ok to do that. The package name is just a name and not using the exact assembly name for its last part doesn't have negative consequences in practice.
The new error you get (Error in genome[["chr1"]] : no such sequence
) should be self-explaining. It tells you that the genome
object does not contain the chr1
sequence. You need to use a valid sequence name in your example, that is, a sequence name that is in the UCB_Xtro_10.0
assembly.
Hope this helps
Hi @Simplecodez
The error you got earlier (
Malformed package name
) was because you had underscores (_
) in the name of the package:BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0
. R does not allow that. In this case, you can just remove the underscores from the package name. This will lead to the following name:BSgenome.Xtropicalis.NCBI.UCBXtro10.0
.Note that we try to have the last part in the package name to be the same as the assembly name. However, as you can see, it's not always possible. In the case of
BSgenome.Xtropicalis.NCBI.UCBXtro10.0
, the last part is close but not exactly equal to the assembly name. It's ok to do that. The package name is just a name and not using the exact assembly name for its last part doesn't have negative consequences in practice.The new error you get (
Error in genome[["chr1"]] : no such sequence
) should be self-explaining. It tells you that thegenome
object does not contain thechr1
sequence. You need to use a valid sequence name in your example, that is, a sequence name that is in theUCB_Xtro_10.0
assembly.Hope this helps
Yes, it did. I renamed it to
BSgenome.Xtropicalis.NCBI.UCBXtro10.0
and changed "c" to "C" in genome[["chr1"]]
Thank you
Now everything works fine. I have also created a PR. Thank you
Great!
Here are a few things that I'd like you to try.
After loading the BSgenome.Xtropicalis.NCBI.UCBXtro10.0 package, do seqinfo(Xtropicalis)
. This should return exactly the same thing as getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE)
. You can check this with:
identical(seqinfo(Xtropicalis), getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE))
Let's take a closer look at the circularity flags of the sequences in the Xtropicalis
object. You can extract those flags with isCircular(Xtropicalis)
(this is a shortcut for isCircular(seqinfo(Xtropicalis))
). Do they look as expected? Note that the vector returned by isCircular(Xtropicalis)
is a named logical vector i.e. a logical vector with names on it. You can call which()
on that vector to extract the positions of the TRUE
values. See ?which
in the base package for more information.
Right now the sequences in the Xtropicalis
object (a BSgenome object) are named according to the NCBI naming convention. We can see this with:
seqlevelsStyle(Xtropicalis) # NCBI
However, the sequences can easily be renamed to use the UCSC names. Try:
seqlevelsStyle(Xtropicalis) <- "UCSC"
seqinfo(Xtropicalis)
This works because of all the work you've done with the registration of xenTro10
and UCB_Xtro_10.0
, and because you linked them together.
You should be able to switch back to the NCBI names with:
seqlevelsStyle(Xtropicalis) # UCSC
seqlevelsStyle(Xtropicalis) <- "NCBI"
seqinfo(Xtropicalis)
The ability to switch easily from one naming convention to the other is a very popular feature in Bioconductor. There has always been a strong demand for it. getChromInfoFromUCSC(.., map.NCBI=TRUE)
is the workhorse behind it.
Note that our ability to rename sequences with the seqlevelsStyle()
setter is not restricted to BSgenome objects. It also works with other objects in Bioconductor, like TxDb objects:
library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
txdb
# TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: xenTro10
# Organism: Xenopus tropicalis
# Taxonomy ID: 8364
# UCSC Table: ncbiRefSeqCurated
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: no gene ids
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 8787
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-10-31 09:56:53 -0700 (Mon, 31 Oct 2022)
# GenomicFeatures version at creation time: 1.49.8
# RSQLite version at creation time: 2.2.18
# DBSCHEMAVERSION: 1.2
seqinfo(txdb)
# Seqinfo object with 167 sequences (1 circular) from xenTro10 genome:
# seqnames seqlengths isCircular genome
# chr1 217471166 FALSE xenTro10
# chr2 181034961 FALSE xenTro10
# chr3 153873357 FALSE xenTro10
# chr4 153961319 FALSE xenTro10
# chr5 164033575 FALSE xenTro10
# ... ... ... ...
# chrUn_NW_022279498v1 786 FALSE xenTro10
# chrUn_NW_022279499v1 755 FALSE xenTro10
# chrUn_NW_022279500v1 748 FALSE xenTro10
# chrUn_NW_022279501v1 593 FALSE xenTro10
# chrUn_NW_022279502v1 582 FALSE xenTro10
identical(seqinfo(txdb), getChromInfoFromUCSC("xenTro10", as.Seqinfo=TRUE))
# [1] TRUE
seqlevelsStyle(txdb)
# [1] "UCSC"
seqlevelsStyle(txdb) <- "NCBI"
seqinfo(txdb)
# Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
# seqnames seqlengths isCircular genome
# Chr1 217471166 FALSE UCB_Xtro_10.0
# Chr2 181034961 FALSE UCB_Xtro_10.0
# Chr3 153873357 FALSE UCB_Xtro_10.0
# Chr4 153961319 FALSE UCB_Xtro_10.0
# Chr5 164033575 FALSE UCB_Xtro_10.0
# ... ... ... ...
# Sca152 786 FALSE UCB_Xtro_10.0
# Sca153 755 FALSE UCB_Xtro_10.0
# Sca154 748 FALSE UCB_Xtro_10.0
# Sca155 593 FALSE UCB_Xtro_10.0
# Sca156 582 FALSE UCB_Xtro_10.0
TxDb objects are objects that contain the positions of various genomic features of interest (e.g. genes, transcripts, exons) with respect to a given genome assembly. For example, we can extract the transcripts positions like this:
transcript_ranges <- transcripts(txdb)
transcript_ranges
# GRanges object with 8787 ranges and 2 metadata columns:
# seqnames ranges strand | tx_id tx_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# [1] Chr1 362967-405411 + | 1 NM_001079275.1
# [2] Chr1 726791-735801 + | 2 NM_001197203.1
# [3] Chr1 6704711-6719765 + | 3 NM_001015807.1
# [4] Chr1 9775906-9796087 + | 4 NM_001030418.1
# [5] Chr1 12319747-12391429 + | 5 NM_001127950.1
# ... ... ... ... . ... ...
# [8783] Sca124 250-1299 - | 8783 NM_001142028.1
# [8784] Sca126 357-1342 - | 8784 NM_001016747.2
# [8785] Sca148 141-838 + | 8785 NM_001142219.1
# [8786] Sca151 187-803 + | 8786 NM_001113880.1
# [8787] Sca156 5-512 - | 8787 NM_001011360.1
# -------
# seqinfo: 167 sequences (1 circular) from UCB_Xtro_10.0 genome
Now let's say that a bioinformatician wants to obtain the DNA sequences for these transcripts. The standard way to do this in Bioconductor is to load the BSgenome data package that contains the DNA sequences of this assembly, and to use getSeq()
to extract the transcript sequences:
library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0)
getSeq(Xtropicalis, transcript_ranges)
Does this work for you? What happens if we switch the sequence names in transcript_ranges
back to the UCSC names (with seqlevelsStyle(transcript_ranges) <- "UCSC"
). Does getSeq(Xtropicalis, transcript_ranges)
still work? What error do you get?
Thanks
Alright sir. Let me try them out now
I made a small correction above. I replaced getChromInfoFromUCSC("UCB_Xtro_10.0", as.Seqinfo=TRUE)
with getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE)
. UCB_Xtro_10.0
is the name of an NCBI assembly, not of a UCSC genome, so we need to use getChromInfoFromNCBI()
, not getChromInfoFromUCSC()
.
I made a small correction above. I replaced
getChromInfoFromUCSC("UCB_Xtro_10.0", as.Seqinfo=TRUE)
withgetChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE)
.UCB_Xtro_10.0
is the name of an NCBI assembly, not of a UCSC genome, so we need to usegetChromInfoFromNCBI()
, notgetChromInfoFromUCSC()
.
Noted sir
Good day sir, when i ran the codes in 1-3 they worked well. when I ran
library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0)
getSeq(Xtropicalis, transcript_ranges)
I got this:
getSeq(Xtropicalis, transcript_ranges)
DNAStringSet object of length 8787:
width seq
[1] 42445 AGGGTTTTCCGGGTGTAGCAAGATGGCCGCC...AAATGTACCCAATAAACACTAAGGATGCCAA
[2] 9011 CTCGCGTATAATTTCTCGCGCTGTGATTGGC...TTTGGGCCAATTAAAGTTTGTATATAATGTA
[3] 15055 GCCGGTGGTGCGGCCCTGGGAGAGGAGCGGC...GTGTCACACCAAATAAATGCCATCCAAGGTA
[4] 20182 AGGACCCTCCCACCTCTCCAACCTTCCAGCC...AATAAAGACTTCCATTTATTTTGGATCCTAA
[5] 71683 CTCAGGGGCCACACACCGCCGGGTCCCTCCT...ATCATAGAATTTTTGTATGAAAAAAAAAAAA
... ... ...
[8783] 1050 GGTTGGCCGCCCAAGATGCCGCTCCCATTGG...AAGTGGCCCCGTACCCCAAATCCAAAAAAGG
[8784] 986 TGCAGACGTACGCTAAGTTCTCCGGAGCCCC...GGGAGGAGGATGAGGATACGGAGAGGCAGGT
[8785] 698 GACAGACTGGTATCGGGGCGCCCAATCATCG...TCCAAAGGGGACAGCAGATATGTCTCGTGAG
[8786] 617 GGTGGGCGCCGTTGCCATGGAGAACTACAAG...ACTCGGGATTGTGGTACCAGAACTTGGGGGG
[8787] 508 CCTGCGGCGGGACGGCTACACGGTGCAGGTG...TGTTTGTGTGCTGTGCCAGCAGTAAGTGCCC
but when I ran
seqlevelsStyle(transcript_ranges) <- "UCSC"
getSeq(Xtropicalis, transcript_ranges)
I got this error:
Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, :
sequence chr1 not found
In addition: Warning message:
In .merge_two_Seqinfo_objects(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
This is the whole result from running 1-4
seqinfo(Xtropicalis)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
seqnames seqlengths isCircular genome
Chr1 217471166 FALSE UCB_Xtro_10.0
Chr2 181034961 FALSE UCB_Xtro_10.0
Chr3 153873357 FALSE UCB_Xtro_10.0
Chr4 153961319 FALSE UCB_Xtro_10.0
Chr5 164033575 FALSE UCB_Xtro_10.0
... ... ... ...
Sca152 786 FALSE UCB_Xtro_10.0
Sca153 755 FALSE UCB_Xtro_10.0
Sca154 748 FALSE UCB_Xtro_10.0
Sca155 593 FALSE UCB_Xtro_10.0
Sca156 582 FALSE UCB_Xtro_10.0
> getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
seqnames seqlengths isCircular genome
Chr1 217471166 FALSE UCB_Xtro_10.0
Chr2 181034961 FALSE UCB_Xtro_10.0
Chr3 153873357 FALSE UCB_Xtro_10.0
Chr4 153961319 FALSE UCB_Xtro_10.0
Chr5 164033575 FALSE UCB_Xtro_10.0
... ... ... ...
Sca152 786 FALSE UCB_Xtro_10.0
Sca153 755 FALSE UCB_Xtro_10.0
Sca154 748 FALSE UCB_Xtro_10.0
Sca155 593 FALSE UCB_Xtro_10.0
Sca156 582 FALSE UCB_Xtro_10.0
> identical(seqinfo(Xtropicalis), getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE))
[1] TRUE
> isCircular(Xtropicalis)
Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 MT
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Sca1 Sca2 Sca3 Sca4 Sca5 Sca6 Sca7 Sca8 Sca9 Sca10 Sca11
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca12 Sca13 Sca14 Sca15 Sca16 Sca17 Sca18 Sca19 Sca20 Sca21 Sca22
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca23 Sca24 Sca25 Sca26 Sca27 Sca28 Sca29 Sca30 Sca31 Sca32 Sca33
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca34 Sca35 Sca36 Sca37 Sca38 Sca39 Sca40 Sca41 Sca42 Sca43 Sca44
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca45 Sca46 Sca47 Sca48 Sca49 Sca50 Sca51 Sca52 Sca53 Sca54 Sca55
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca56 Sca57 Sca58 Sca59 Sca60 Sca61 Sca62 Sca63 Sca64 Sca65 Sca66
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca67 Sca68 Sca69 Sca70 Sca71 Sca72 Sca73 Sca74 Sca75 Sca76 Sca77
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca78 Sca79 Sca80 Sca81 Sca82 Sca83 Sca84 Sca85 Sca86 Sca87 Sca88
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca89 Sca90 Sca91 Sca92 Sca93 Sca94 Sca95 Sca96 Sca97 Sca98 Sca99
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca100 Sca101 Sca102 Sca103 Sca104 Sca105 Sca106 Sca107 Sca108 Sca109 Sca110
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca111 Sca112 Sca113 Sca114 Sca115 Sca116 Sca117 Sca118 Sca119 Sca120 Sca121
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca122 Sca123 Sca124 Sca125 Sca126 Sca127 Sca128 Sca129 Sca130 Sca131 Sca132
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca133 Sca134 Sca135 Sca136 Sca137 Sca138 Sca139 Sca140 Sca141 Sca142 Sca143
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca144 Sca145 Sca146 Sca147 Sca148 Sca149 Sca150 Sca151 Sca152 Sca153 Sca154
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Sca155 Sca156
FALSE FALSE
> seqlevelsStyle(Xtropicalis) # NCBI
[1] "NCBI"
> seqlevelsStyle(Xtropicalis) <- "UCSC"
> seqinfo(Xtropicalis)
Seqinfo object with 167 sequences (1 circular) from xenTro10 genome:
seqnames seqlengths isCircular genome
chr1 217471166 FALSE xenTro10
chr2 181034961 FALSE xenTro10
chr3 153873357 FALSE xenTro10
chr4 153961319 FALSE xenTro10
chr5 164033575 FALSE xenTro10
... ... ... ...
chrUn_NW_022279498v1 786 FALSE xenTro10
chrUn_NW_022279499v1 755 FALSE xenTro10
chrUn_NW_022279500v1 748 FALSE xenTro10
chrUn_NW_022279501v1 593 FALSE xenTro10
chrUn_NW_022279502v1 582 FALSE xenTro10
> seqlevelsStyle(Xtropicalis) # UCSC
[1] "UCSC"
> seqlevelsStyle(Xtropicalis) <- "NCBI"
> seqinfo(Xtropicalis)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
seqnames seqlengths isCircular genome
Chr1 217471166 FALSE UCB_Xtro_10.0
Chr2 181034961 FALSE UCB_Xtro_10.0
Chr3 153873357 FALSE UCB_Xtro_10.0
Chr4 153961319 FALSE UCB_Xtro_10.0
Chr5 164033575 FALSE UCB_Xtro_10.0
... ... ... ...
Sca152 786 FALSE UCB_Xtro_10.0
Sca153 755 FALSE UCB_Xtro_10.0
Sca154 748 FALSE UCB_Xtro_10.0
Sca155 593 FALSE UCB_Xtro_10.0
Sca156 582 FALSE UCB_Xtro_10.0
> library(GenomicFeatures)
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
> txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
Download the ncbiRefSeqCurated table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
UCSC data anomaly in 61 transcript(s): the cds cumulative length is not
a multiple of 3 for transcripts ‘NM_203735.1’ ‘NM_001141992.1’
‘NM_001045606.1’ ‘NM_203834.1’ ‘NM_001097185.1’ ‘NM_001126615.2’
‘NM_001142155.1’ ‘NM_001113115.1’ ‘NM_001252199.1’ ‘NM_001045688.1’
‘NM_001127118.1’ ‘NM_001128050.1’ ‘NM_001011298.1’ ‘NM_001112929.1’
‘NM_001114246.1’ ‘NM_001032305.1’ ‘NM_001017252.2’ ‘NM_001016566.2’
‘NM_001004844.1’ ‘NM_001079461.1’ ‘NM_001127053.1’ ‘NM_203535.1’
‘NM_001114037.1’ ‘NM_001172058.1’ ‘NM_001001231.1’ ‘NM_001079340.1’
‘NM_001126649.1’ ‘NM_001126752.1’ ‘NM_001015787.1’ ‘NM_001126604.1’
‘NM_001126982.1’ ‘NM_001114203.1’ ‘NM_001078998.1’ ‘NM_001001238.1’
‘NM_203727.1’ ‘NM_001102766.1’ ‘NM_001016417.2’ ‘NM_001172020.1’
‘NM_001097239.1’ ‘NM_001008172.1’ ‘NM_001016776.2’ ‘YP_203370.1’
‘YP [... truncated]
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: xenTro10
# Organism: Xenopus tropicalis
# Taxonomy ID: 8364
# UCSC Table: ncbiRefSeqCurated
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: no gene ids
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 8787
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-11-01 09:46:00 +0100 (Tue, 01 Nov 2022)
# GenomicFeatures version at creation time: 1.49.7
# RSQLite version at creation time: 2.2.18
# DBSCHEMAVERSION: 1.2
> seqinfo(txdb)
Seqinfo object with 167 sequences (1 circular) from xenTro10 genome:
seqnames seqlengths isCircular genome
chr1 217471166 FALSE xenTro10
chr2 181034961 FALSE xenTro10
chr3 153873357 FALSE xenTro10
chr4 153961319 FALSE xenTro10
chr5 164033575 FALSE xenTro10
... ... ... ...
chrUn_NW_022279498v1 786 FALSE xenTro10
chrUn_NW_022279499v1 755 FALSE xenTro10
chrUn_NW_022279500v1 748 FALSE xenTro10
chrUn_NW_022279501v1 593 FALSE xenTro10
chrUn_NW_022279502v1 582 FALSE xenTro10
> identical(seqinfo(txdb), getChromInfoFromUCSC("xenTro10", as.Seqinfo=TRUE))
[1] TRUE
> seqlevelsStyle(txdb)
[1] "UCSC"
> seqlevelsStyle(txdb) <- "NCBI"
> seqinfo(txdb)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
seqnames seqlengths isCircular genome
Chr1 217471166 FALSE UCB_Xtro_10.0
Chr2 181034961 FALSE UCB_Xtro_10.0
Chr3 153873357 FALSE UCB_Xtro_10.0
Chr4 153961319 FALSE UCB_Xtro_10.0
Chr5 164033575 FALSE UCB_Xtro_10.0
... ... ... ...
Sca152 786 FALSE UCB_Xtro_10.0
Sca153 755 FALSE UCB_Xtro_10.0
Sca154 748 FALSE UCB_Xtro_10.0
Sca155 593 FALSE UCB_Xtro_10.0
Sca156 582 FALSE UCB_Xtro_10.0
> transcript_ranges <- transcripts(txdb)
> transcript_ranges
GRanges object with 8787 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] Chr1 362967-405411 + | 1 NM_001079275.1
[2] Chr1 726791-735801 + | 2 NM_001197203.1
[3] Chr1 6704711-6719765 + | 3 NM_001015807.1
[4] Chr1 9775906-9796087 + | 4 NM_001030418.1
[5] Chr1 12319747-12391429 + | 5 NM_001127950.1
... ... ... ... . ... ...
[8783] Sca124 250-1299 - | 8783 NM_001142028.1
[8784] Sca126 357-1342 - | 8784 NM_001016747.2
[8785] Sca148 141-838 + | 8785 NM_001142219.1
[8786] Sca151 187-803 + | 8786 NM_001113880.1
[8787] Sca156 5-512 - | 8787 NM_001011360.1
-------
seqinfo: 167 sequences (1 circular) from UCB_Xtro_10.0 genome
> getSeq(Xtropicalis, transcript_ranges)
DNAStringSet object of length 8787:
width seq
[1] 42445 AGGGTTTTCCGGGTGTAGCAAGATGGCCGCC...AAATGTACCCAATAAACACTAAGGATGCCAA
[2] 9011 CTCGCGTATAATTTCTCGCGCTGTGATTGGC...TTTGGGCCAATTAAAGTTTGTATATAATGTA
[3] 15055 GCCGGTGGTGCGGCCCTGGGAGAGGAGCGGC...GTGTCACACCAAATAAATGCCATCCAAGGTA
[4] 20182 AGGACCCTCCCACCTCTCCAACCTTCCAGCC...AATAAAGACTTCCATTTATTTTGGATCCTAA
[5] 71683 CTCAGGGGCCACACACCGCCGGGTCCCTCCT...ATCATAGAATTTTTGTATGAAAAAAAAAAAA
... ... ...
[8783] 1050 GGTTGGCCGCCCAAGATGCCGCTCCCATTGG...AAGTGGCCCCGTACCCCAAATCCAAAAAAGG
[8784] 986 TGCAGACGTACGCTAAGTTCTCCGGAGCCCC...GGGAGGAGGATGAGGATACGGAGAGGCAGGT
[8785] 698 GACAGACTGGTATCGGGGCGCCCAATCATCG...TCCAAAGGGGACAGCAGATATGTCTCGTGAG
[8786] 617 GGTGGGCGCCGTTGCCATGGAGAACTACAAG...ACTCGGGATTGTGGTACCAGAACTTGGGGGG
[8787] 508 CCTGCGGCGGGACGGCTACACGGTGCAGGTG...TGTTTGTGTGCTGTGCCAGCAGTAAGTGCCC
> seqlevelsStyle(transcript_ranges) <- "UCSC"
> getSeq(Xtropicalis, transcript_ranges)
Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, :
sequence chr1 not found
In addition: Warning message:
In .merge_two_Seqinfo_objects(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
Perfect. Everything looks as expected. Even the error you got with your last call to getSeq(Xtropicalis, transcript_ranges)
.
It's important to understand that you got this error because the sequence names in Xtropicalis
use a different convention than the sequence names in transcript_ranges
. The former uses the NCBI names and the latter the UCSC names. This is a frequent situation in biocinformatics and a constant source of frustration. The analyst often needs to harmonize the sequence names used in various datasets as a preliminary step to their analysis.
This harmonization of the sequence names is very easy to do in Bioconductor. We do it by calling the seqlevelsStyle()
setter on one of the two objects. So in the above example, we could do this harmonization with:
seqlevelsStyle(Xtropicalis) <- seqlevelsStyle(transcript_ranges)
or with:
seqlevelsStyle(transcript_ranges) <- seqlevelsStyle(Xtropicalis)
It works both ways.
So try this again:
library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0)
library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
transcript_ranges <- transcripts(txdb)
getSeq(Xtropicalis, transcript_ranges) # error!
and solve the problem by harmonizing the sequence names in Xtropicalis
and transcript_ranges
.
Also, on a slightly different topic, please take a look at my long comment here where I try to explain how the code in the fasta_to_sorted_2bit_for_Felis_catus_9.0.R
script works. The code in your fasta_to_sorted_2bit_for_UCB_Xtro_10.0.R
script works the same way so my explanations apply to your script too.
Be sure to let me know if you have questions.
Perfect. Everything looks as expected. Even the error you got with your last call to
getSeq(Xtropicalis, transcript_ranges)
.It's important to understand that you got this error because the sequence names in
Xtropicalis
use a different convention than the sequence names intranscript_ranges
. The former uses the NCBI names and the latter the UCSC names. This is a frequent situation in biocinformatics and a constant source of frustration. The analyst often needs to harmonize the sequence names used in various datasets as a preliminary step to their analysis.This harmonization of the sequence names is very easy to do in Bioconductor. We do it by calling the
seqlevelsStyle()
setter on one of the two objects. So in the above example, we could do this harmonization with:seqlevelsStyle(Xtropicalis) <- seqlevelsStyle(transcript_ranges)
or with:
seqlevelsStyle(transcript_ranges) <- seqlevelsStyle(Xtropicalis)
It works both ways.
So try this again:
library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0) library(GenomicFeatures) txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated") transcript_ranges <- transcripts(txdb) getSeq(Xtropicalis, transcript_ranges) # error!
Yes sir, it works now. I just tried it. Below is the result I got:
> txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
Download the ncbiRefSeqCurated table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
UCSC data anomaly in 61 transcript(s): the cds cumulative length is not
a multiple of 3 for transcripts ‘NM_203735.1’ ‘NM_001141992.1’
‘NM_001045606.1’ ‘NM_203834.1’ ‘NM_001097185.1’ ‘NM_001126615.2’
‘NM_001142155.1’ ‘NM_001113115.1’ ‘NM_001252199.1’ ‘NM_001045688.1’
‘NM_001127118.1’ ‘NM_001128050.1’ ‘NM_001011298.1’ ‘NM_001112929.1’
‘NM_001114246.1’ ‘NM_001032305.1’ ‘NM_001017252.2’ ‘NM_001016566.2’
‘NM_001004844.1’ ‘NM_001079461.1’ ‘NM_001127053.1’ ‘NM_203535.1’
‘NM_001114037.1’ ‘NM_001172058.1’ ‘NM_001001231.1’ ‘NM_001079340.1’
‘NM_001126649.1’ ‘NM_001126752.1’ ‘NM_001015787.1’ ‘NM_001126604.1’
‘NM_001126982.1’ ‘NM_001114203.1’ ‘NM_001078998.1’ ‘NM_001001238.1’
‘NM_203727.1’ ‘NM_001102766.1’ ‘NM_001016417.2’ ‘NM_001172020.1’
‘NM_001097239.1’ ‘NM_001008172.1’ ‘NM_001016776.2’ ‘YP_203370.1’
‘YP [... truncated]
> transcript_ranges <- transcripts(txdb)
> seqlevelsStyle(Xtropicalis) <- seqlevelsStyle(transcript_ranges)
> getSeq(Xtropicalis, transcript_ranges)
DNAStringSet object of length 8787:
width seq
[1] 42445 AGGGTTTTCCGGGTGTAGCAAGATGGCCGCC...AAATGTACCCAATAAACACTAAGGATGCCAA
[2] 9011 CTCGCGTATAATTTCTCGCGCTGTGATTGGC...TTTGGGCCAATTAAAGTTTGTATATAATGTA
[3] 15055 GCCGGTGGTGCGGCCCTGGGAGAGGAGCGGC...GTGTCACACCAAATAAATGCCATCCAAGGTA
[4] 20182 AGGACCCTCCCACCTCTCCAACCTTCCAGCC...AATAAAGACTTCCATTTATTTTGGATCCTAA
[5] 71683 CTCAGGGGCCACACACCGCCGGGTCCCTCCT...ATCATAGAATTTTTGTATGAAAAAAAAAAAA
... ... ...
[8783] 1050 GGTTGGCCGCCCAAGATGCCGCTCCCATTGG...AAGTGGCCCCGTACCCCAAATCCAAAAAAGG
[8784] 986 TGCAGACGTACGCTAAGTTCTCCGGAGCCCC...GGGAGGAGGATGAGGATACGGAGAGGCAGGT
[8785] 698 GACAGACTGGTATCGGGGCGCCCAATCATCG...TCCAAAGGGGACAGCAGATATGTCTCGTGAG
[8786] 617 GGTGGGCGCCGTTGCCATGGAGAACTACAAG...ACTCGGGATTGTGGTACCAGAACTTGGGGGG
[8787] 508 CCTGCGGCGGGACGGCTACACGGTGCAGGTG...TGTTTGTGTGCTGTGCCAGCAGTAAGTGCCC
>
I think everything is working okay now. Thank you
and solve the problem by harmonizing the sequence names in
Xtropicalis
andtranscript_ranges
.Also, on a slightly different topic, please take a look at my long comment here where I try to explain how the code in the
fasta_to_sorted_2bit_for_Felis_catus_9.0.R
script works. The code in yourfasta_to_sorted_2bit_for_UCB_Xtro_10.0.R
script works the same way so my explanations apply to your script too.Be sure to let me know if you have questions.
Alright sir, I will. Thank you.
Great. I'm closing this.
Please remember to record your contribution on Outreachy.
We're approaching the end of the application period so there's not much time left. If you want to work on a last task, please choose one from the "Additional tasks" section at the bottom of this page. Note that the first two are already assigned. Thanks!
Alright, sir. Thank you. I am preparing my final application current.
This task depends on this issue and issue #40 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.
BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences, of a particular genome assembly for a given organism. For example BSgenome.Amellifera.NCBI.AmelHAv3.1 is a BSgenome data package that contains the genomic sequences of NCBI assembly
Amel_HAv3.1
(RefSeq assembly accession:GCF_003254395.2
). Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.This task's goal is to make a new BSgenome data package for NCBI assembly
UCB_Xtro_10.0
. The process of making such package is partially documented in the "How to forge a BSgenome data package" vignette from the BSgenome software package. The landing page for the BSgenome package contains a link to this vignette.Here's some additional information
For an NCBI assembly, the sequence data can be found in the form of compressed FASTA files located in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly associated with the assembly. You'll find the links to those FTP directories on the right side of the NCBI page of the
UCB_Xtro_10.0
assembly, below the link to the Full sequence report.Note that NCBI uses the
.fna
suffix for uncompressed FASTA files (UCSC uses the.fa
suffix). For compressed FASTA files, NCBI and UCSC both add the.gz
suffix to the.fna
or.fa
suffix. As you will see, NCBI can provide several.fna.gz
files in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly.The file that contains all the DNA sequences for the assembly is expected to have a name that looks like this:
<RefSeq assembly accession>_<assembly name>_genomic.fna.gz
in the FTP directory for RefSeq assembly, and<GenBank assembly accession>_<assembly name>_genomic.fna.gz
in the FTP directory for GenBank assembly.How you will proceed
The first thing you should do is download the appropriate FASTA file. Choose the RefSeq one if
UCB_Xtro_10.0
was registered with its RefSeq assembly accession, or the GenBank one if it was registered with its GenBank assembly accession.Then load it into R with the
readDNAStringSet()
function from the Biostrings package. The object returned byreadDNAStringSet()
is a DNAStringSet object. Take a look at the names on this object (callnames()
on the object to extract them, or callhead(names())
to look at the 6 first names only). As you can see, these names are long and a little bit cryptic. They don't exactly look like the sequence names returned bygetChromInfoFromNCBI("UCB_Xtro_10.0")
:cry: We'll take care of this below.The
fasta_to_sorted_2bit.R
scriptAs we know, the forging process only supports the two following options for the input sequence data: a single 2bit file, or a collection of FASTA files with one file per sequence. This means that we won't be able to use the
<assembly accession>_<assembly name>_genomic.fna.gz
file as-is. We will first need to preprocess the file as follow: (1) either convert the file to the 2bit format, or (2) split the file into a collection of smaller FASTA files where each file contains only one sequence. It is recommended to do (1).Here are some examples of R scripts that have been used to perform this conversion for other NCBI assemblies:
Amel_HAv3.1
:inst/extdata/GentlemanLab/BSgenome.Amellifera.NCBI.AmelHAv3.1-tools/fasta_to_sorted_2bit.R
Macaca_fascicularis_6.0
:inst/extdata/GentlemanLab/BSgenome.Mfascicularis.NCBI.6.0-tools/fasta_to_sorted_2bit.R
Note that these scripts also take care of two important things:
readDNAStringSet()
with the sequence names returned bygetChromInfoFromNCBI()
.getChromInfoFromNCBI()
.Choose one script, download it or copy it to your computer, rename it
fasta_to_sorted_2bit_for_UCB_Xtro_10.0.R
, and adapt it to make it work for your<assembly accession>_<assembly name>_genomic.fna.gz
file. When you work on this, it will help you if you try to execute the script line by line, and take the time to study what each line is doing. Don't hesitate to inspect each intermediate result. Name the output file produced by your scriptUCB_Xtro_10.0.sorted.2bit
.Once the script is ready, please add it to your fork of the BSgenome package, in the
inst/extdata/Outreachy/
folder.The seed file
The seed file that you're going to prepare for the forging process should point to the 2bit file that you produced with your
fasta_to_sorted_2bit.R
script above.Prepare the file and use it to forge the BSgenome data package.
Test the new package like you did with BSgenome.Xtropicalis.UCSC.xenTro10. That is: install it, start R, load the package, and play around with it. Proceed with some "ad hoc manual testing". Once everything behaves and looks as expected, run
R CMD build
andR CMD check
on the package. Note thatR CMD check
should always be run on the source tarball produced byR CMD build
.Once you are confident that your new BSgenome data package works as expected, please add your seed file to your fork of the BSgenome package, in the
inst/extdata/Outreachy/
folder.Finally commit, push, and submit a PR.