Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Proposed task for Outreachy applicants: Forge BSgenome data package for NCBI assembly UCB_Xtro_10.0 #41

Closed hpages closed 1 year ago

hpages commented 1 year ago

This task depends on this issue and issue #40 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences, of a particular genome assembly for a given organism. For example BSgenome.Amellifera.NCBI.AmelHAv3.1 is a BSgenome data package that contains the genomic sequences of NCBI assembly Amel_HAv3.1 (RefSeq assembly accession: GCF_003254395.2). Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.

This task's goal is to make a new BSgenome data package for NCBI assembly UCB_Xtro_10.0. The process of making such package is partially documented in the "How to forge a BSgenome data package" vignette from the BSgenome software package. The landing page for the BSgenome package contains a link to this vignette.

Here's some additional information

For an NCBI assembly, the sequence data can be found in the form of compressed FASTA files located in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly associated with the assembly. You'll find the links to those FTP directories on the right side of the NCBI page of the UCB_Xtro_10.0 assembly, below the link to the Full sequence report.

Note that NCBI uses the .fna suffix for uncompressed FASTA files (UCSC uses the .fa suffix). For compressed FASTA files, NCBI and UCSC both add the .gz suffix to the .fna or .fa suffix. As you will see, NCBI can provide several .fna.gz files in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly.

The file that contains all the DNA sequences for the assembly is expected to have a name that looks like this: <RefSeq assembly accession>_<assembly name>_genomic.fna.gz in the FTP directory for RefSeq assembly, and <GenBank assembly accession>_<assembly name>_genomic.fna.gz in the FTP directory for GenBank assembly.

How you will proceed

The first thing you should do is download the appropriate FASTA file. Choose the RefSeq one if UCB_Xtro_10.0 was registered with its RefSeq assembly accession, or the GenBank one if it was registered with its GenBank assembly accession.

Then load it into R with the readDNAStringSet() function from the Biostrings package. The object returned by readDNAStringSet() is a DNAStringSet object. Take a look at the names on this object (call names() on the object to extract them, or call head(names()) to look at the 6 first names only). As you can see, these names are long and a little bit cryptic. They don't exactly look like the sequence names returned by getChromInfoFromNCBI("UCB_Xtro_10.0") :cry: We'll take care of this below.

The fasta_to_sorted_2bit.R script

As we know, the forging process only supports the two following options for the input sequence data: a single 2bit file, or a collection of FASTA files with one file per sequence. This means that we won't be able to use the <assembly accession>_<assembly name>_genomic.fna.gz file as-is. We will first need to preprocess the file as follow: (1) either convert the file to the 2bit format, or (2) split the file into a collection of smaller FASTA files where each file contains only one sequence. It is recommended to do (1).

Here are some examples of R scripts that have been used to perform this conversion for other NCBI assemblies:

Note that these scripts also take care of two important things:

  1. They rename the sequences, that is, they replace the long and cryptic names on the DNAStringSet object returned by readDNAStringSet() with the sequence names returned by getChromInfoFromNCBI().
  2. They put the sequences in the same order as in the result returned by getChromInfoFromNCBI().

Choose one script, download it or copy it to your computer, rename it fasta_to_sorted_2bit_for_UCB_Xtro_10.0.R, and adapt it to make it work for your <assembly accession>_<assembly name>_genomic.fna.gz file. When you work on this, it will help you if you try to execute the script line by line, and take the time to study what each line is doing. Don't hesitate to inspect each intermediate result. Name the output file produced by your script UCB_Xtro_10.0.sorted.2bit.

Once the script is ready, please add it to your fork of the BSgenome package, in the inst/extdata/Outreachy/ folder.

The seed file

The seed file that you're going to prepare for the forging process should point to the 2bit file that you produced with your fasta_to_sorted_2bit.R script above.

Prepare the file and use it to forge the BSgenome data package.

Test the new package like you did with BSgenome.Xtropicalis.UCSC.xenTro10. That is: install it, start R, load the package, and play around with it. Proceed with some "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.

Once you are confident that your new BSgenome data package works as expected, please add your seed file to your fork of the BSgenome package, in the inst/extdata/Outreachy/ folder.

Finally commit, push, and submit a PR.

Simplecodez commented 1 year ago

Please assign me to this task. Thank you

hpages commented 1 year ago

Done.

Don't forget to record #40 on Outreachy.

Simplecodez commented 1 year ago

Done.

Don't forget to record #40 on Outreachy.

I will do that sir. Thank you

Simplecodez commented 1 year ago

Sir, I have been able to create the UCB_Xtro_10.0.sorted.2bit file and forged the package, but when I run R CMD build BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0, I get the error below.


* preparing ‘BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0’:
* checking DESCRIPTION meta-information ... ERROR
Malformed package name```
Simplecodez commented 1 year ago

Below is the content of my seed file:


Package: BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0
Title: Full genome sequences for Xenopos tropicalis (NCBI version UCB_Xtro_10.0)
Description: Full genome sequences for Xenopus tropicalis as provided by NCBI (UCB_Xtro_10.0, 2019/11/14) and stored in Biostrings objects.
Version: 0.1.0
organism: Xenopus tropicalis
common_name: Tropical clawed frog
genome: UCB_Xtro_10.0
provider: NCBI
release_date: 2019/11/14
source_url: https://www.ncbi.nlm.nih.gov/assembly/GCF_000004195.4/
organism_biocview: Xenopus_tropicalis
BSgenomeObjname: Xtropicalis
SrcDataFiles: GCF_000004195.4_UCB_Xtro_10.0_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/195/GCF_000004195.4_UCB_Xtro_10.0/
PkgExamples: genome$chr1 # same as genome[["chr1"]]
seqs_srcdir: /home/emmanuel/seqdatafile
seqfile_name: UCB_Xtro_10.0.sorted.2bit```.
Simplecodez commented 1 year ago

I tried changing the name of the seed file to BSgenome.Xtropicalis.NCBI.10.0-seed and changing the package name to BSgenome.Xtropicalis.NCBI.10.0 in the seed but I dont know if the name follows the NCBI standard. Nonetheless, it was able to build successfully but when i ran R CMD check on the tarball produced from building, i got the error below:

checking examples ... ERROR
Running examples in ‘BSgenome.Xtropicalis.NCBI.10.0-Ex.R’ failed
The error most likely occurred in:

> ### Name: BSgenome.Xtropicalis.NCBI.10.0
> ### Title: Full genome sequences for Xenopos tropicalis (NCBI version
> ###   UCB_Xtro_10.0)
> ### Aliases: BSgenome.Xtropicalis.NCBI.10.0-package
> ###   BSgenome.Xtropicalis.NCBI.10.0 Xtropicalis
> ### Keywords: package data
> 
> ### ** Examples
> 
> BSgenome.Xtropicalis.NCBI.10.0
Tropical clawed frog genome:
# organism: Xenopus tropicalis (Tropical clawed frog)
# genome: UCB_Xtro_10.0
# provider: NCBI
# release date: 2019/11/14
# 167 sequences:
#   Chr1   Chr2   Chr3   Chr4   Chr5   Chr6   Chr7   Chr8   Chr9   Chr10  MT    
#   Sca1   Sca2   Sca3   Sca4   Sca5   Sca6   Sca7   Sca8   Sca9   Sca10  Sca11 
#   Sca12  Sca13  Sca14  Sca15  Sca16  Sca17  Sca18  Sca19  Sca20  Sca21  Sca22 
#   Sca23  Sca24  Sca25  Sca26  Sca27  Sca28  Sca29  Sca30  Sca31  Sca32  Sca33 
#   Sca34  Sca35  Sca36  Sca37  Sca38  Sca39  Sca40  Sca41  Sca42  Sca43  Sca44 
#   ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
#   Sca111 Sca112 Sca113 Sca114 Sca115 Sca116 Sca117 Sca118 Sca119 Sca120 Sca121
#   Sca122 Sca123 Sca124 Sca125 Sca126 Sca127 Sca128 Sca129 Sca130 Sca131 Sca132
#   Sca133 Sca134 Sca135 Sca136 Sca137 Sca138 Sca139 Sca140 Sca141 Sca142 Sca143
#   Sca144 Sca145 Sca146 Sca147 Sca148 Sca149 Sca150 Sca151 Sca152 Sca153 Sca154
#   Sca155 Sca156                                                               
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> genome <- BSgenome.Xtropicalis.NCBI.10.0
> head(seqlengths(genome))
     Chr1      Chr2      Chr3      Chr4      Chr5      Chr6 
217471166 181034961 153873357 153961319 164033575 154486312 
> genome[["chr1"]]
Error in genome[["chr1"]] : no such sequence
Calls: [[ -> [[
Execution halted
hpages commented 1 year ago

Hi @Simplecodez

The error you got earlier (Malformed package name) was because you had underscores (_) in the name of the package: BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0. R does not allow that. In this case, you can just remove the underscores from the package name. This will lead to the following name: BSgenome.Xtropicalis.NCBI.UCBXtro10.0.

Note that we try to have the last part in the package name to be the same as the assembly name. However, as you can see, it's not always possible. In the case of BSgenome.Xtropicalis.NCBI.UCBXtro10.0, the last part is close but not exactly equal to the assembly name. It's ok to do that. The package name is just a name and not using the exact assembly name for its last part doesn't have negative consequences in practice.

The new error you get (Error in genome[["chr1"]] : no such sequence) should be self-explaining. It tells you that the genome object does not contain the chr1 sequence. You need to use a valid sequence name in your example, that is, a sequence name that is in the UCB_Xtro_10.0 assembly.

Hope this helps

Simplecodez commented 1 year ago

Hi @Simplecodez

The error you got earlier (Malformed package name) was because you had underscores (_) in the name of the package: BSgenome.Xtropicalis.NCBI.UCB_Xtro_10.0. R does not allow that. In this case, you can just remove the underscores from the package name. This will lead to the following name: BSgenome.Xtropicalis.NCBI.UCBXtro10.0.

Note that we try to have the last part in the package name to be the same as the assembly name. However, as you can see, it's not always possible. In the case of BSgenome.Xtropicalis.NCBI.UCBXtro10.0, the last part is close but not exactly equal to the assembly name. It's ok to do that. The package name is just a name and not using the exact assembly name for its last part doesn't have negative consequences in practice.

The new error you get (Error in genome[["chr1"]] : no such sequence) should be self-explaining. It tells you that the genome object does not contain the chr1 sequence. You need to use a valid sequence name in your example, that is, a sequence name that is in the UCB_Xtro_10.0 assembly.

Hope this helps

Yes, it did. I renamed it to BSgenome.Xtropicalis.NCBI.UCBXtro10.0 and changed "c" to "C" in genome[["chr1"]] Thank you

Simplecodez commented 1 year ago

Now everything works fine. I have also created a PR. Thank you

hpages commented 1 year ago

Great!

Here are a few things that I'd like you to try.

  1. After loading the BSgenome.Xtropicalis.NCBI.UCBXtro10.0 package, do seqinfo(Xtropicalis). This should return exactly the same thing as getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE). You can check this with:

    identical(seqinfo(Xtropicalis), getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE))
  2. Let's take a closer look at the circularity flags of the sequences in the Xtropicalis object. You can extract those flags with isCircular(Xtropicalis) (this is a shortcut for isCircular(seqinfo(Xtropicalis))). Do they look as expected? Note that the vector returned by isCircular(Xtropicalis) is a named logical vector i.e. a logical vector with names on it. You can call which() on that vector to extract the positions of the TRUE values. See ?which in the base package for more information.

  3. Right now the sequences in the Xtropicalis object (a BSgenome object) are named according to the NCBI naming convention. We can see this with:

    seqlevelsStyle(Xtropicalis)  # NCBI

    However, the sequences can easily be renamed to use the UCSC names. Try:

    seqlevelsStyle(Xtropicalis) <- "UCSC"
    seqinfo(Xtropicalis)

    This works because of all the work you've done with the registration of xenTro10 and UCB_Xtro_10.0, and because you linked them together.

    You should be able to switch back to the NCBI names with:

    seqlevelsStyle(Xtropicalis)  # UCSC
    seqlevelsStyle(Xtropicalis) <- "NCBI"
    seqinfo(Xtropicalis)

    The ability to switch easily from one naming convention to the other is a very popular feature in Bioconductor. There has always been a strong demand for it. getChromInfoFromUCSC(.., map.NCBI=TRUE) is the workhorse behind it.

    Note that our ability to rename sequences with the seqlevelsStyle() setter is not restricted to BSgenome objects. It also works with other objects in Bioconductor, like TxDb objects:

    library(GenomicFeatures)
    txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
    txdb
    # TxDb object:
    # Db type: TxDb
    # Supporting package: GenomicFeatures
    # Data source: UCSC
    # Genome: xenTro10
    # Organism: Xenopus tropicalis
    # Taxonomy ID: 8364
    # UCSC Table: ncbiRefSeqCurated
    # UCSC Track: NCBI RefSeq
    # Resource URL: http://genome.ucsc.edu/
    # Type of Gene ID: no gene ids
    # Full dataset: yes
    # miRBase build ID: NA
    # Nb of transcripts: 8787
    # Db created by: GenomicFeatures package from Bioconductor
    # Creation time: 2022-10-31 09:56:53 -0700 (Mon, 31 Oct 2022)
    # GenomicFeatures version at creation time: 1.49.8
    # RSQLite version at creation time: 2.2.18
    # DBSCHEMAVERSION: 1.2
    
    seqinfo(txdb)
    # Seqinfo object with 167 sequences (1 circular) from xenTro10 genome:
    #   seqnames             seqlengths isCircular   genome
    #   chr1                  217471166      FALSE xenTro10
    #   chr2                  181034961      FALSE xenTro10
    #   chr3                  153873357      FALSE xenTro10
    #   chr4                  153961319      FALSE xenTro10
    #   chr5                  164033575      FALSE xenTro10
    #   ...                         ...        ...      ...
    #   chrUn_NW_022279498v1        786      FALSE xenTro10
    #   chrUn_NW_022279499v1        755      FALSE xenTro10
    #   chrUn_NW_022279500v1        748      FALSE xenTro10
    #   chrUn_NW_022279501v1        593      FALSE xenTro10
    #   chrUn_NW_022279502v1        582      FALSE xenTro10
    
    identical(seqinfo(txdb), getChromInfoFromUCSC("xenTro10", as.Seqinfo=TRUE))
    # [1] TRUE
    
    seqlevelsStyle(txdb)
    # [1] "UCSC"
    
    seqlevelsStyle(txdb) <- "NCBI"
    
    seqinfo(txdb)
    # Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
    #   seqnames seqlengths isCircular        genome
    #   Chr1      217471166      FALSE UCB_Xtro_10.0
    #   Chr2      181034961      FALSE UCB_Xtro_10.0
    #   Chr3      153873357      FALSE UCB_Xtro_10.0
    #   Chr4      153961319      FALSE UCB_Xtro_10.0
    #   Chr5      164033575      FALSE UCB_Xtro_10.0
    #   ...             ...        ...           ...
    #   Sca152          786      FALSE UCB_Xtro_10.0
    #   Sca153          755      FALSE UCB_Xtro_10.0
    #   Sca154          748      FALSE UCB_Xtro_10.0
    #   Sca155          593      FALSE UCB_Xtro_10.0
    #   Sca156          582      FALSE UCB_Xtro_10.0
  4. TxDb objects are objects that contain the positions of various genomic features of interest (e.g. genes, transcripts, exons) with respect to a given genome assembly. For example, we can extract the transcripts positions like this:

    transcript_ranges <- transcripts(txdb)
    transcript_ranges
    # GRanges object with 8787 ranges and 2 metadata columns:
    #          seqnames            ranges strand |     tx_id        tx_name
    #             <Rle>         <IRanges>  <Rle> | <integer>    <character>
    #      [1]     Chr1     362967-405411      + |         1 NM_001079275.1
    #      [2]     Chr1     726791-735801      + |         2 NM_001197203.1
    #      [3]     Chr1   6704711-6719765      + |         3 NM_001015807.1
    #      [4]     Chr1   9775906-9796087      + |         4 NM_001030418.1
    #      [5]     Chr1 12319747-12391429      + |         5 NM_001127950.1
    #      ...      ...               ...    ... .       ...            ...
    #   [8783]   Sca124          250-1299      - |      8783 NM_001142028.1
    #   [8784]   Sca126          357-1342      - |      8784 NM_001016747.2
    #   [8785]   Sca148           141-838      + |      8785 NM_001142219.1
    #   [8786]   Sca151           187-803      + |      8786 NM_001113880.1
    #   [8787]   Sca156             5-512      - |      8787 NM_001011360.1
    #   -------
    #   seqinfo: 167 sequences (1 circular) from UCB_Xtro_10.0 genome

    Now let's say that a bioinformatician wants to obtain the DNA sequences for these transcripts. The standard way to do this in Bioconductor is to load the BSgenome data package that contains the DNA sequences of this assembly, and to use getSeq() to extract the transcript sequences:

    library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0)
    getSeq(Xtropicalis, transcript_ranges)

    Does this work for you? What happens if we switch the sequence names in transcript_ranges back to the UCSC names (with seqlevelsStyle(transcript_ranges) <- "UCSC"). Does getSeq(Xtropicalis, transcript_ranges) still work? What error do you get?

Thanks

Simplecodez commented 1 year ago

Alright sir. Let me try them out now

hpages commented 1 year ago

I made a small correction above. I replaced getChromInfoFromUCSC("UCB_Xtro_10.0", as.Seqinfo=TRUE) with getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE). UCB_Xtro_10.0 is the name of an NCBI assembly, not of a UCSC genome, so we need to use getChromInfoFromNCBI(), not getChromInfoFromUCSC().

Simplecodez commented 1 year ago

I made a small correction above. I replaced getChromInfoFromUCSC("UCB_Xtro_10.0", as.Seqinfo=TRUE) with getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE). UCB_Xtro_10.0 is the name of an NCBI assembly, not of a UCSC genome, so we need to use getChromInfoFromNCBI(), not getChromInfoFromUCSC().

Noted sir

Simplecodez commented 1 year ago

Good day sir, when i ran the codes in 1-3 they worked well. when I ran

library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0)
getSeq(Xtropicalis, transcript_ranges)

I got this:

getSeq(Xtropicalis, transcript_ranges)
DNAStringSet object of length 8787:
         width seq
   [1]   42445 AGGGTTTTCCGGGTGTAGCAAGATGGCCGCC...AAATGTACCCAATAAACACTAAGGATGCCAA
   [2]    9011 CTCGCGTATAATTTCTCGCGCTGTGATTGGC...TTTGGGCCAATTAAAGTTTGTATATAATGTA
   [3]   15055 GCCGGTGGTGCGGCCCTGGGAGAGGAGCGGC...GTGTCACACCAAATAAATGCCATCCAAGGTA
   [4]   20182 AGGACCCTCCCACCTCTCCAACCTTCCAGCC...AATAAAGACTTCCATTTATTTTGGATCCTAA
   [5]   71683 CTCAGGGGCCACACACCGCCGGGTCCCTCCT...ATCATAGAATTTTTGTATGAAAAAAAAAAAA
   ...     ... ...
[8783]    1050 GGTTGGCCGCCCAAGATGCCGCTCCCATTGG...AAGTGGCCCCGTACCCCAAATCCAAAAAAGG
[8784]     986 TGCAGACGTACGCTAAGTTCTCCGGAGCCCC...GGGAGGAGGATGAGGATACGGAGAGGCAGGT
[8785]     698 GACAGACTGGTATCGGGGCGCCCAATCATCG...TCCAAAGGGGACAGCAGATATGTCTCGTGAG
[8786]     617 GGTGGGCGCCGTTGCCATGGAGAACTACAAG...ACTCGGGATTGTGGTACCAGAACTTGGGGGG
[8787]     508 CCTGCGGCGGGACGGCTACACGGTGCAGGTG...TGTTTGTGTGCTGTGCCAGCAGTAAGTGCCC

but when I ran

seqlevelsStyle(transcript_ranges) <- "UCSC"
getSeq(Xtropicalis, transcript_ranges)

I got this error:

Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width,  : 
  sequence chr1 not found
In addition: Warning message:
In .merge_two_Seqinfo_objects(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
Simplecodez commented 1 year ago

This is the whole result from running 1-4

seqinfo(Xtropicalis)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
  seqnames seqlengths isCircular        genome
  Chr1      217471166      FALSE UCB_Xtro_10.0
  Chr2      181034961      FALSE UCB_Xtro_10.0
  Chr3      153873357      FALSE UCB_Xtro_10.0
  Chr4      153961319      FALSE UCB_Xtro_10.0
  Chr5      164033575      FALSE UCB_Xtro_10.0
  ...             ...        ...           ...
  Sca152          786      FALSE UCB_Xtro_10.0
  Sca153          755      FALSE UCB_Xtro_10.0
  Sca154          748      FALSE UCB_Xtro_10.0
  Sca155          593      FALSE UCB_Xtro_10.0
  Sca156          582      FALSE UCB_Xtro_10.0
> getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
  seqnames seqlengths isCircular        genome
  Chr1      217471166      FALSE UCB_Xtro_10.0
  Chr2      181034961      FALSE UCB_Xtro_10.0
  Chr3      153873357      FALSE UCB_Xtro_10.0
  Chr4      153961319      FALSE UCB_Xtro_10.0
  Chr5      164033575      FALSE UCB_Xtro_10.0
  ...             ...        ...           ...
  Sca152          786      FALSE UCB_Xtro_10.0
  Sca153          755      FALSE UCB_Xtro_10.0
  Sca154          748      FALSE UCB_Xtro_10.0
  Sca155          593      FALSE UCB_Xtro_10.0
  Sca156          582      FALSE UCB_Xtro_10.0
> identical(seqinfo(Xtropicalis), getChromInfoFromNCBI("UCB_Xtro_10.0", as.Seqinfo=TRUE))
[1] TRUE
> isCircular(Xtropicalis)
  Chr1   Chr2   Chr3   Chr4   Chr5   Chr6   Chr7   Chr8   Chr9  Chr10     MT 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE 
  Sca1   Sca2   Sca3   Sca4   Sca5   Sca6   Sca7   Sca8   Sca9  Sca10  Sca11 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca12  Sca13  Sca14  Sca15  Sca16  Sca17  Sca18  Sca19  Sca20  Sca21  Sca22 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca23  Sca24  Sca25  Sca26  Sca27  Sca28  Sca29  Sca30  Sca31  Sca32  Sca33 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca34  Sca35  Sca36  Sca37  Sca38  Sca39  Sca40  Sca41  Sca42  Sca43  Sca44 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca45  Sca46  Sca47  Sca48  Sca49  Sca50  Sca51  Sca52  Sca53  Sca54  Sca55 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca56  Sca57  Sca58  Sca59  Sca60  Sca61  Sca62  Sca63  Sca64  Sca65  Sca66 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca67  Sca68  Sca69  Sca70  Sca71  Sca72  Sca73  Sca74  Sca75  Sca76  Sca77 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca78  Sca79  Sca80  Sca81  Sca82  Sca83  Sca84  Sca85  Sca86  Sca87  Sca88 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
 Sca89  Sca90  Sca91  Sca92  Sca93  Sca94  Sca95  Sca96  Sca97  Sca98  Sca99 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
Sca100 Sca101 Sca102 Sca103 Sca104 Sca105 Sca106 Sca107 Sca108 Sca109 Sca110 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
Sca111 Sca112 Sca113 Sca114 Sca115 Sca116 Sca117 Sca118 Sca119 Sca120 Sca121 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
Sca122 Sca123 Sca124 Sca125 Sca126 Sca127 Sca128 Sca129 Sca130 Sca131 Sca132 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
Sca133 Sca134 Sca135 Sca136 Sca137 Sca138 Sca139 Sca140 Sca141 Sca142 Sca143 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
Sca144 Sca145 Sca146 Sca147 Sca148 Sca149 Sca150 Sca151 Sca152 Sca153 Sca154 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE 
Sca155 Sca156 
 FALSE  FALSE 
> seqlevelsStyle(Xtropicalis)  # NCBI
[1] "NCBI"
> seqlevelsStyle(Xtropicalis) <- "UCSC"
> seqinfo(Xtropicalis)
Seqinfo object with 167 sequences (1 circular) from xenTro10 genome:
  seqnames             seqlengths isCircular   genome
  chr1                  217471166      FALSE xenTro10
  chr2                  181034961      FALSE xenTro10
  chr3                  153873357      FALSE xenTro10
  chr4                  153961319      FALSE xenTro10
  chr5                  164033575      FALSE xenTro10
  ...                         ...        ...      ...
  chrUn_NW_022279498v1        786      FALSE xenTro10
  chrUn_NW_022279499v1        755      FALSE xenTro10
  chrUn_NW_022279500v1        748      FALSE xenTro10
  chrUn_NW_022279501v1        593      FALSE xenTro10
  chrUn_NW_022279502v1        582      FALSE xenTro10
> seqlevelsStyle(Xtropicalis)  # UCSC
[1] "UCSC"
> seqlevelsStyle(Xtropicalis) <- "NCBI"
> seqinfo(Xtropicalis)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
  seqnames seqlengths isCircular        genome
  Chr1      217471166      FALSE UCB_Xtro_10.0
  Chr2      181034961      FALSE UCB_Xtro_10.0
  Chr3      153873357      FALSE UCB_Xtro_10.0
  Chr4      153961319      FALSE UCB_Xtro_10.0
  Chr5      164033575      FALSE UCB_Xtro_10.0
  ...             ...        ...           ...
  Sca152          786      FALSE UCB_Xtro_10.0
  Sca153          755      FALSE UCB_Xtro_10.0
  Sca154          748      FALSE UCB_Xtro_10.0
  Sca155          593      FALSE UCB_Xtro_10.0
  Sca156          582      FALSE UCB_Xtro_10.0
> library(GenomicFeatures)
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
Download the ncbiRefSeqCurated table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
  UCSC data anomaly in 61 transcript(s): the cds cumulative length is not
  a multiple of 3 for transcripts ‘NM_203735.1’ ‘NM_001141992.1’
  ‘NM_001045606.1’ ‘NM_203834.1’ ‘NM_001097185.1’ ‘NM_001126615.2’
  ‘NM_001142155.1’ ‘NM_001113115.1’ ‘NM_001252199.1’ ‘NM_001045688.1’
  ‘NM_001127118.1’ ‘NM_001128050.1’ ‘NM_001011298.1’ ‘NM_001112929.1’
  ‘NM_001114246.1’ ‘NM_001032305.1’ ‘NM_001017252.2’ ‘NM_001016566.2’
  ‘NM_001004844.1’ ‘NM_001079461.1’ ‘NM_001127053.1’ ‘NM_203535.1’
  ‘NM_001114037.1’ ‘NM_001172058.1’ ‘NM_001001231.1’ ‘NM_001079340.1’
  ‘NM_001126649.1’ ‘NM_001126752.1’ ‘NM_001015787.1’ ‘NM_001126604.1’
  ‘NM_001126982.1’ ‘NM_001114203.1’ ‘NM_001078998.1’ ‘NM_001001238.1’
  ‘NM_203727.1’ ‘NM_001102766.1’ ‘NM_001016417.2’ ‘NM_001172020.1’
  ‘NM_001097239.1’ ‘NM_001008172.1’ ‘NM_001016776.2’ ‘YP_203370.1’
  ‘YP [... truncated]
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: xenTro10
# Organism: Xenopus tropicalis
# Taxonomy ID: 8364
# UCSC Table: ncbiRefSeqCurated
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: no gene ids
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 8787
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-11-01 09:46:00 +0100 (Tue, 01 Nov 2022)
# GenomicFeatures version at creation time: 1.49.7
# RSQLite version at creation time: 2.2.18
# DBSCHEMAVERSION: 1.2
> seqinfo(txdb)
Seqinfo object with 167 sequences (1 circular) from xenTro10 genome:
  seqnames             seqlengths isCircular   genome
  chr1                  217471166      FALSE xenTro10
  chr2                  181034961      FALSE xenTro10
  chr3                  153873357      FALSE xenTro10
  chr4                  153961319      FALSE xenTro10
  chr5                  164033575      FALSE xenTro10
  ...                         ...        ...      ...
  chrUn_NW_022279498v1        786      FALSE xenTro10
  chrUn_NW_022279499v1        755      FALSE xenTro10
  chrUn_NW_022279500v1        748      FALSE xenTro10
  chrUn_NW_022279501v1        593      FALSE xenTro10
  chrUn_NW_022279502v1        582      FALSE xenTro10
> identical(seqinfo(txdb), getChromInfoFromUCSC("xenTro10", as.Seqinfo=TRUE))
[1] TRUE
> seqlevelsStyle(txdb)
[1] "UCSC"
> seqlevelsStyle(txdb) <- "NCBI"
> seqinfo(txdb)
Seqinfo object with 167 sequences (1 circular) from UCB_Xtro_10.0 genome:
  seqnames seqlengths isCircular        genome
  Chr1      217471166      FALSE UCB_Xtro_10.0
  Chr2      181034961      FALSE UCB_Xtro_10.0
  Chr3      153873357      FALSE UCB_Xtro_10.0
  Chr4      153961319      FALSE UCB_Xtro_10.0
  Chr5      164033575      FALSE UCB_Xtro_10.0
  ...             ...        ...           ...
  Sca152          786      FALSE UCB_Xtro_10.0
  Sca153          755      FALSE UCB_Xtro_10.0
  Sca154          748      FALSE UCB_Xtro_10.0
  Sca155          593      FALSE UCB_Xtro_10.0
  Sca156          582      FALSE UCB_Xtro_10.0
> transcript_ranges <- transcripts(txdb)
> transcript_ranges
GRanges object with 8787 ranges and 2 metadata columns:
         seqnames            ranges strand |     tx_id        tx_name
            <Rle>         <IRanges>  <Rle> | <integer>    <character>
     [1]     Chr1     362967-405411      + |         1 NM_001079275.1
     [2]     Chr1     726791-735801      + |         2 NM_001197203.1
     [3]     Chr1   6704711-6719765      + |         3 NM_001015807.1
     [4]     Chr1   9775906-9796087      + |         4 NM_001030418.1
     [5]     Chr1 12319747-12391429      + |         5 NM_001127950.1
     ...      ...               ...    ... .       ...            ...
  [8783]   Sca124          250-1299      - |      8783 NM_001142028.1
  [8784]   Sca126          357-1342      - |      8784 NM_001016747.2
  [8785]   Sca148           141-838      + |      8785 NM_001142219.1
  [8786]   Sca151           187-803      + |      8786 NM_001113880.1
  [8787]   Sca156             5-512      - |      8787 NM_001011360.1
  -------
  seqinfo: 167 sequences (1 circular) from UCB_Xtro_10.0 genome
> getSeq(Xtropicalis, transcript_ranges)
DNAStringSet object of length 8787:
         width seq
   [1]   42445 AGGGTTTTCCGGGTGTAGCAAGATGGCCGCC...AAATGTACCCAATAAACACTAAGGATGCCAA
   [2]    9011 CTCGCGTATAATTTCTCGCGCTGTGATTGGC...TTTGGGCCAATTAAAGTTTGTATATAATGTA
   [3]   15055 GCCGGTGGTGCGGCCCTGGGAGAGGAGCGGC...GTGTCACACCAAATAAATGCCATCCAAGGTA
   [4]   20182 AGGACCCTCCCACCTCTCCAACCTTCCAGCC...AATAAAGACTTCCATTTATTTTGGATCCTAA
   [5]   71683 CTCAGGGGCCACACACCGCCGGGTCCCTCCT...ATCATAGAATTTTTGTATGAAAAAAAAAAAA
   ...     ... ...
[8783]    1050 GGTTGGCCGCCCAAGATGCCGCTCCCATTGG...AAGTGGCCCCGTACCCCAAATCCAAAAAAGG
[8784]     986 TGCAGACGTACGCTAAGTTCTCCGGAGCCCC...GGGAGGAGGATGAGGATACGGAGAGGCAGGT
[8785]     698 GACAGACTGGTATCGGGGCGCCCAATCATCG...TCCAAAGGGGACAGCAGATATGTCTCGTGAG
[8786]     617 GGTGGGCGCCGTTGCCATGGAGAACTACAAG...ACTCGGGATTGTGGTACCAGAACTTGGGGGG
[8787]     508 CCTGCGGCGGGACGGCTACACGGTGCAGGTG...TGTTTGTGTGCTGTGCCAGCAGTAAGTGCCC
> seqlevelsStyle(transcript_ranges) <- "UCSC"
> getSeq(Xtropicalis, transcript_ranges)
Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width,  : 
  sequence chr1 not found
In addition: Warning message:
In .merge_two_Seqinfo_objects(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
hpages commented 1 year ago

Perfect. Everything looks as expected. Even the error you got with your last call to getSeq(Xtropicalis, transcript_ranges).

It's important to understand that you got this error because the sequence names in Xtropicalis use a different convention than the sequence names in transcript_ranges. The former uses the NCBI names and the latter the UCSC names. This is a frequent situation in biocinformatics and a constant source of frustration. The analyst often needs to harmonize the sequence names used in various datasets as a preliminary step to their analysis.

This harmonization of the sequence names is very easy to do in Bioconductor. We do it by calling the seqlevelsStyle() setter on one of the two objects. So in the above example, we could do this harmonization with:

seqlevelsStyle(Xtropicalis) <- seqlevelsStyle(transcript_ranges)

or with:

seqlevelsStyle(transcript_ranges) <- seqlevelsStyle(Xtropicalis)

It works both ways.

So try this again:

library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0)
library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
transcript_ranges <- transcripts(txdb)
getSeq(Xtropicalis, transcript_ranges)  # error!

and solve the problem by harmonizing the sequence names in Xtropicalis and transcript_ranges.

Also, on a slightly different topic, please take a look at my long comment here where I try to explain how the code in the fasta_to_sorted_2bit_for_Felis_catus_9.0.R script works. The code in your fasta_to_sorted_2bit_for_UCB_Xtro_10.0.R script works the same way so my explanations apply to your script too.

Be sure to let me know if you have questions.

Simplecodez commented 1 year ago

Perfect. Everything looks as expected. Even the error you got with your last call to getSeq(Xtropicalis, transcript_ranges).

It's important to understand that you got this error because the sequence names in Xtropicalis use a different convention than the sequence names in transcript_ranges. The former uses the NCBI names and the latter the UCSC names. This is a frequent situation in biocinformatics and a constant source of frustration. The analyst often needs to harmonize the sequence names used in various datasets as a preliminary step to their analysis.

This harmonization of the sequence names is very easy to do in Bioconductor. We do it by calling the seqlevelsStyle() setter on one of the two objects. So in the above example, we could do this harmonization with:

seqlevelsStyle(Xtropicalis) <- seqlevelsStyle(transcript_ranges)

or with:

seqlevelsStyle(transcript_ranges) <- seqlevelsStyle(Xtropicalis)

It works both ways.

So try this again:

library(BSgenome.Xtropicalis.NCBI.UCBXtro10.0)
library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
transcript_ranges <- transcripts(txdb)
getSeq(Xtropicalis, transcript_ranges)  # error!

Yes sir, it works now. I just tried it. Below is the result I got:

> txdb <- makeTxDbFromUCSC("xenTro10", tablename="ncbiRefSeqCurated")
Download the ncbiRefSeqCurated table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extract_cds_locs_from_UCSC_txtable(ucsc_txtable) :
  UCSC data anomaly in 61 transcript(s): the cds cumulative length is not
  a multiple of 3 for transcripts ‘NM_203735.1’ ‘NM_001141992.1’
  ‘NM_001045606.1’ ‘NM_203834.1’ ‘NM_001097185.1’ ‘NM_001126615.2’
  ‘NM_001142155.1’ ‘NM_001113115.1’ ‘NM_001252199.1’ ‘NM_001045688.1’
  ‘NM_001127118.1’ ‘NM_001128050.1’ ‘NM_001011298.1’ ‘NM_001112929.1’
  ‘NM_001114246.1’ ‘NM_001032305.1’ ‘NM_001017252.2’ ‘NM_001016566.2’
  ‘NM_001004844.1’ ‘NM_001079461.1’ ‘NM_001127053.1’ ‘NM_203535.1’
  ‘NM_001114037.1’ ‘NM_001172058.1’ ‘NM_001001231.1’ ‘NM_001079340.1’
  ‘NM_001126649.1’ ‘NM_001126752.1’ ‘NM_001015787.1’ ‘NM_001126604.1’
  ‘NM_001126982.1’ ‘NM_001114203.1’ ‘NM_001078998.1’ ‘NM_001001238.1’
  ‘NM_203727.1’ ‘NM_001102766.1’ ‘NM_001016417.2’ ‘NM_001172020.1’
  ‘NM_001097239.1’ ‘NM_001008172.1’ ‘NM_001016776.2’ ‘YP_203370.1’
  ‘YP [... truncated]
> transcript_ranges <- transcripts(txdb)
> seqlevelsStyle(Xtropicalis) <- seqlevelsStyle(transcript_ranges)
> getSeq(Xtropicalis, transcript_ranges)
DNAStringSet object of length 8787:
         width seq
   [1]   42445 AGGGTTTTCCGGGTGTAGCAAGATGGCCGCC...AAATGTACCCAATAAACACTAAGGATGCCAA
   [2]    9011 CTCGCGTATAATTTCTCGCGCTGTGATTGGC...TTTGGGCCAATTAAAGTTTGTATATAATGTA
   [3]   15055 GCCGGTGGTGCGGCCCTGGGAGAGGAGCGGC...GTGTCACACCAAATAAATGCCATCCAAGGTA
   [4]   20182 AGGACCCTCCCACCTCTCCAACCTTCCAGCC...AATAAAGACTTCCATTTATTTTGGATCCTAA
   [5]   71683 CTCAGGGGCCACACACCGCCGGGTCCCTCCT...ATCATAGAATTTTTGTATGAAAAAAAAAAAA
   ...     ... ...
[8783]    1050 GGTTGGCCGCCCAAGATGCCGCTCCCATTGG...AAGTGGCCCCGTACCCCAAATCCAAAAAAGG
[8784]     986 TGCAGACGTACGCTAAGTTCTCCGGAGCCCC...GGGAGGAGGATGAGGATACGGAGAGGCAGGT
[8785]     698 GACAGACTGGTATCGGGGCGCCCAATCATCG...TCCAAAGGGGACAGCAGATATGTCTCGTGAG
[8786]     617 GGTGGGCGCCGTTGCCATGGAGAACTACAAG...ACTCGGGATTGTGGTACCAGAACTTGGGGGG
[8787]     508 CCTGCGGCGGGACGGCTACACGGTGCAGGTG...TGTTTGTGTGCTGTGCCAGCAGTAAGTGCCC
>

I think everything is working okay now. Thank you

and solve the problem by harmonizing the sequence names in Xtropicalis and transcript_ranges.

Also, on a slightly different topic, please take a look at my long comment here where I try to explain how the code in the fasta_to_sorted_2bit_for_Felis_catus_9.0.R script works. The code in your fasta_to_sorted_2bit_for_UCB_Xtro_10.0.R script works the same way so my explanations apply to your script too.

Be sure to let me know if you have questions.

Alright sir, I will. Thank you.

hpages commented 1 year ago

Great. I'm closing this.

Please remember to record your contribution on Outreachy.

We're approaching the end of the application period so there's not much time left. If you want to work on a last task, please choose one from the "Additional tasks" section at the bottom of this page. Note that the first two are already assigned. Thanks!

Simplecodez commented 1 year ago

Alright, sir. Thank you. I am preparing my final application current.