Bioconductor / txdbmaker

A set of tools for making TxDb objects from genomic annotations from various sources (e.g. UCSC, Ensembl, and GFF files)
2 stars 0 forks source link

How to make a TxDb object for the T2T-CHM13v2.0 genome (telomere to telomere Human genome) #1

Open hpages opened 5 months ago

hpages commented 5 months ago

[Moved from https://github.com/Bioconductor/GenomicFeatures/issues/65 on March 22, 2024]

Question: How to make a TxDb object for the T2T-CHM13v2.0 genome (telomere to telomere Human genome), a.k.a. the hs1 genome at UCSC.

Answer: Unfortunately, makeTxDbFromUCSC() doesn't support hs1 at the moment, so we're going to use the GFF file provided by NCBI.

  1. Download GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/

  2. Import the GFF file as a GRanges object:

    library(rtracklayer)
    
    ## Takes < 1 min, consumes about 7Gb of RAM
    gff <- import("GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz")
  3. Note that the sequence names in the GRanges object are RefSeq accessions:

    seqlevels(gff)
    #  [1] "NC_060925.1" "NC_060926.1" "NC_060927.1" "NC_060928.1" "NC_060929.1"
    #  [6] "NC_060930.1" "NC_060931.1" "NC_060932.1" "NC_060933.1" "NC_060934.1"
    # [11] "NC_060935.1" "NC_060936.1" "NC_060937.1" "NC_060938.1" "NC_060939.1"
    # [16] "NC_060940.1" "NC_060941.1" "NC_060942.1" "NC_060943.1" "NC_060944.1"
    # [21] "NC_060945.1" "NC_060946.1" "NC_060947.1" "NC_060948.1"

    Let's change them to the official chromosome names:

    library(GenomeInfoDb)
    chrominfo <- getChromInfoFromNCBI("T2T-CHM13v2.0")
    seqlevels(gff) <- setNames(chrominfo$SequenceName, chrominfo$RefSeqAccn)
    seqlevels(gff)
    #  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
    # [16] "16" "17" "18" "19" "20" "21" "22" "X"  "Y"  "MT"
  4. Add the complete sequence info to the GRanges object:

    seqinfo(gff) <- Seqinfo(genome="T2T-CHM13v2.0")
    seqinfo(gff)
    # Seqinfo object with 25 sequences (1 circular) from T2T-CHM13v2.0 genome:
    #   seqnames seqlengths isCircular        genome
    #   1         248387328      FALSE T2T-CHM13v2.0
    #   2         242696752      FALSE T2T-CHM13v2.0
    #   3         201105948      FALSE T2T-CHM13v2.0
    #   4         193574945      FALSE T2T-CHM13v2.0
    #   5         182045439      FALSE T2T-CHM13v2.0
    #   ...             ...        ...           ...
    #   21         45090682      FALSE T2T-CHM13v2.0
    #   22         51324926      FALSE T2T-CHM13v2.0
    #   X         154259566      FALSE T2T-CHM13v2.0
    #   Y          62460029      FALSE T2T-CHM13v2.0
    #   MT            16569       TRUE T2T-CHM13v2.0
  5. Use makeTxDbFromGRanges() to make a TxDb object from the GRanges object:

    library(txdbmaker)
    
    ## This will emit 3 warnings that can be ignored.
    txdb <- makeTxDbFromGRanges(gff, taxonomyId=9606)
    
    txdb
    # TxDb object:
    ## Db type: TxDb
    ## Supporting package: GenomicFeatures
    ## Genome: T2T-CHM13v2.0
    ## Organism: Homo sapiens
    ## Taxonomy ID: 9606
    ## Nb of transcripts: 188205
    ## Db created by: txdbmaker package from Bioconductor
    ## Creation time: 2024-03-22 16:56:52 -0700 (Fri, 22 Mar 2024)
    ## txdbmaker version at creation time: 0.99.7
    ## RSQLite version at creation time: 2.3.5
    ## DBSCHEMAVERSION: 1.2

Note that if you need the UCSC chromosome names instead of the NCBI ones, you can switch them with seqlevelsStyle():

seqlevelsStyle(txdb)
# [1] "NCBI"

seqlevelsStyle(txdb) <- "UCSC"

seqlevelsStyle(txdb)
# [1] "UCSC"

seqlevels(txdb)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 

H.

Claudiolete commented 2 weeks ago

Hello, I am having problems with the getChromInfoFromNCBI function. When I try chrominfo <- getChromInfoFromNCBI("T2T-CHM13v2.0"), shows this: Error in download.file(url, destfile, method, quiet = TRUE): it was not possible to open the following URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/' Warning: In download.file(url, destfile, method, quiet = TRUE) : URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/': Timeout of 60 seconds was reached

Is this some trouble with my connection or with something else?

Thanks in advance

hpages commented 1 week ago

What's your sessionInfo()?

Is the following code working for you?

library(GenomeInfoDb)
list_ftp_dir("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/")

Please show the output you get.

If it doesn't work then yes it's some trouble with your connection. If you are behind an HTTP proxy (a common set up at many institutions), then it could also be a problem with the configuration of the proxy, in which case you would need to talk with your institution IT.

FWIW I just tried this again with BioC 3.19 (the latest Bioconductor release, requires R 4.4), and it works fine for me:

library(GenomeInfoDb)
list_ftp_dir("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/")
# [1] "GCA_009914755.1_T2T-CHM13v0.7" "GCA_009914755.2_T2T-CHM13v1.0"
# [3] "GCA_009914755.3_T2T-CHM13v1.1" "GCA_009914755.4_T2T-CHM13v2.0"

My sessionInfo():

R version 4.4.0 alpha (2024-04-03 r86327)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 23.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.4.r86327/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GenomeInfoDb_1.40.1 IRanges_2.38.1      S4Vectors_0.42.1   
[4] BiocGenerics_0.50.0

loaded via a namespace (and not attached):
[1] httr_1.4.7              compiler_4.4.0          R6_2.5.1               
[4] tools_4.4.0             GenomeInfoDbData_1.2.12 UCSC.utils_1.0.0       
[7] jsonlite_1.8.8