Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

BSgenome.Rnorvegicus.NCBI.rn6 #36

Closed Dalhte closed 1 year ago

Dalhte commented 1 year ago

Hi there. I'm tryng to analyse snATAC-seq data on Rat tissues using Signac. My snATAC data have this kind of "NCBI" format : 1-172749-173853 so I think I need a BSgenome package for the Rat genome version RN6 but in NCBI format, so I tried to forge one. I downloaded the BSgenome package and the fasta file from Ensembl (named 1.fa and so on..). Then created the seed as follow :

Package: BSgenome.Rnorvegicus.NCBI.rn6 Title: Full genome sequences for rattus norvegicus (NCBI version rn6) Description: Full genome sequences for rattus norvegicus as provided by NCBI Version: 1.0.0 organism: rattus norvegicus common_name: Rat provider: NCBI genome: Rnor_6.0 release_date: 2014 source_url: http://ftp.ensembl.org/pub/release-104/fasta/rattus_norvegicus/dna/ organism_biocview: rattus norvegicus BSgenomeObjname: Rnorvegicus.NCBI.rn6 seqnames: paste("", c(1:20, "X", "Y", "M", "_random"), sep="") circ_seqs: "M" seqs_srcdir: /media/david/F/Rattus genome/fasta

I'm getting :

forgeBSgenomeDataPkg("/media/david/F/Rattus genome/BSgenome.Rnorvegicus.NCBI.rn6-seed") Creating package in ./BSgenome.Rnorvegicus.NCBI.rn6 Loading '1' sequence from FASTA file '/media/david/F/Rattus genome/fasta/1.fa' ... DONE Loading... ..... Loading '_random' sequence from FASTA file '/media/david/F/Rattus genome/fasta/_random.fa' ... DONE Writing all sequences to './BSgenome.Rnorvegicus.NCBI.rn6/inst/extdata/single_sequences.2bit' ... DONE Message d'avis : Dans .forgeTwobitFileFromFastaFiles(seqnames, prefix, suffix, seqs_srcdir, : file contains 932 sequences, using the first sequence only

so working just fine I beleive. But when I check, it's still using the UCSC like "chr1" format instead of an NCBI like "1" format for chromosomes. How can I get around this ? I'm new to these analyses so it's highly possible I'm just doing something stupid here.

hpages commented 1 year ago

Hi,

What do you mean exactly by "in NCBI format"? AFAIK rn6 (UCSC) is the same as Rnor_6.0 (NCBI) except for the chromosome/scaffold names. Those can be easily switched back and forth with seqlevelsStyle():

> library(BSgenome.Rnorvegicus.UCSC.rn6)
> genome <- BSgenome.Rnorvegicus.UCSC.rn6

> genome
Rat genome:
# organism: Rattus norvegicus (Rat)
# genome: rn6
# provider: UCSC
# release date: Jul. 2014
# 953 sequences:
#   chr1                        chr2                       
#   chr3                        chr4                       
#   chr5                        chr6                       
#   chr7                        chr8                       
#   chr9                        chr10                      
#   ...                         ...                        
#   chrUn_KL568510v1            chrUn_KL568511v1           
#   chrUn_KL568512v1            chrUn_KL568513v1           
#   chrUn_KL568514v1            chrUn_KL568515v1           
#   chrUn_KL568516v1            chrUn_KL568517v1           
#   chrUn_KL568518v1                                       
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)

> seqlevelsStyle(genome)
[1] "UCSC"

> seqlevelsStyle(genome) <- "NCBI"
> genome
Rat genome:
# organism: Rattus norvegicus (Rat)
# genome: rn6
# provider: UCSC
# release date: Jul. 2014
# 953 sequences:
#   chr1            chr2            chr3            chr4           
#   chr5            chr6            chr7            chr8           
#   chr9            chr10           chr11           chr12          
#   chr13           chr14           chr15           chr16          
#   chr17           chr18           chr19           chr20          
#   ...             ...             ...             ...            
#   chrUn.8         chrUn.80        chrUn.81        chrUn.82       
#   chrUn.84        chrUn.86        chrUn.88        chrUn.89       
#   chrUn.9         chrUn.90        chrUn.92        chrUn.93       
#   chrUn.94        chrUn.96        chrUn.97        chrUn.98       
#   chrUn.99                                                       
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)

> seqlevelsStyle(genome)
[1] "NCBI"

Cheers, H.

Dalhte commented 1 year ago

Hi ! Thanks for the fast answer. I mean by Ncbi format the way sequences are called Ncbi is something like 1:100000-100100 for the 100 bp of the first chromosome at the 100000 bp position but ucsc would use something like chr1:100,000-100,100 i think My peaks are in the ncbi way but my bsgenome is in ucsc way. Does your solution will transform the BSgenome.Rnorvegicus.UCSC.rn6 BSgenome.Rnorvegicus.NCBI.rn6 with the "right" way to index the sequences ? Best D.

hpages commented 1 year ago

Hmm.. Not sure how what NCBI or UCSC do to "call sequences" (whatever than means) has anything to do with what we put in a BSgenome data package.

A BSgenome data package contains the chromosome/scaffold sequences, that's all. In addition the BSgenome software package provides an API to access those sequences, mostly genome$chr1, genome[["chr1"]], and getSeq(). This API is the same whatever the data is coming from.

If you've tried to run some code using BSgenome.Rnorvegicus.NCBI.rn6 and that code didn't work as you'd have expected, maybe you can share the details so we understand what the problem is? Please show code that we can run to reproduce the problem. Thanks

hpages commented 1 year ago

Ncbi is something like 1:100000-100100 for the 100 bp of the first chromosome at the 100000 bp position but ucsc would use something like chr1:100,000-100,100 i think

Now that I read this again: 1:100000-100100 won't work for the Rnor_6.0 (NCBI) assembly because this assembly has no chromosme named 1. See my second last post above for the chromosome/scaffold names used in Rnor_6.0.

You can actually see those names here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/895/GCF_000001895.5_Rnor_6.0/GCF_000001895.5_Rnor_6.0_assembly_report.txt (leftmost column, Sequence-Name).

Using chr1, chr2, etc... is very unusual for an NCBI assembly but that's what the people who submitted Rnor_6.0 (the Rat Genome Sequencing Consortium people) have done. Wrapping this in a BSgenome data package won't change that.

Dalhte commented 1 year ago

Hi The script I'm runing using BSgenome is

RegionStats(PG24, genome = BSgenome.Rnorvegicus.UCSC.rn6) Where PG24 is my signac object I get : Erreur dans .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence 1 not found De plus : Message d'avis : Dans .Seqinfo.mergexy(x, y) : The 2 combined objects have no sequence levels in common. (Use suppressWarnings() to suppress this warning.)

same thing with my custom BSgenome.Rnorvegicus.NCBI.rn6 or using the >seqlevelsStyle(genome) <- "NCBI"

don't know if it can help but here is random things :

PG24[['ATAC']] ChromatinAssay data with 134655 features for 10149 cells Variable features: 0 Genome: Annotation present: TRUE Motifs present: FALSE Fragment files: 1

and

str(PG24[['ATAC']]) Formal class 'ChromatinAssay' [package "Signac"] with 16 slots ..@ ranges :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots .. .. ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots .. .. .. .. ..@ values : Factor w/ 42 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ... .. .. .. .. ..@ lengths : int [1:42] 14066 10731 9445 8970 8783 7755 7853 7561 6195 7599 ... .. .. .. .. ..@ elementMetadata: NULL .. .. .. .. ..@ metadata : list() ..... ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots .. .. ..@ i : int [1:30392387] 32 62 283 354 383 535 543 562 572 645 ... .. .. ..@ p : int [1:10150] 0 3508 5602 9380 12679 15819 22498 25845 27977 28986 ... .. .. ..@ Dim : int [1:2] 134655 10149 .. .. ..@ Dimnames:List of 2 .. .. .. ..$ : chr [1:134655] "1-173268-173874" "1-174544-175414" "1-669086-670122" "1-1116585-1117435" ... .. .. .. ..$ : chr [1:10149] "AAACAGCCAATTAGCT-1" "AAACAGCCAGATTCAT-1" "AAACAGCCAGTTTGTG-1" "AAACATGCAAATTCGT-1" ... .. .. ..@ x : num [1:30392387] 2 2 2 2 2 2 2 2 2 2 ... .. .. ..@ factors : list() you can see there that fragments have names in "NCBI formats" ( "1-173268-173874"). That why I though it was the probleme. it would be similar to what I read there : https://support.bioconductor.org/p/101694/

I must be possible to construct a BSgenome file with the appropriate format as for example :

BSgenome.Hsapiens.NCBI.GRCh38 Human genome:

organism: Homo sapiens (Human)

genome: GRCh38

provider: NCBI

release date: 2013-12-17

455 sequences:

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

17 18 19 20

... ... ... ...

HSCHR19KIR_LUCE_A_HAP_CTG3_1 HSCHR19KIR_LUCE_BDEL_HAP_CTG3_1 HSCHR19KIR_RSH_A_HAP_CTG3_1 HSCHR19KIR_RSH_BA2_HAP_CTG3_1

HSCHR19KIR_T7526_A_HAP_CTG3_1 HSCHR19KIR_T7526_BDEL_HAP_CTG3_1 HSCHR19KIR_ABC08_A1_HAP_CTG3_1 HSCHR19KIR_ABC08_AB_HAP_C_P_CTG3_1

HSCHR19KIR_ABC08_AB_HAP_T_P_CTG3_1 HSCHR19KIR_FH05_A_HAP_CTG3_1 HSCHR19KIR_FH05_B_HAP_CTG3_1 HSCHR19KIR_FH06_A_HAP_CTG3_1

HSCHR19KIR_FH06_BA1_HAP_CTG3_1 HSCHR19KIR_FH08_A_HAP_CTG3_1 HSCHR19KIR_FH08_BAX_HAP_CTG3_1 HSCHR19KIR_FH13_A_HAP_CTG3_1

HSCHR19KIR_FH13_BA2_HAP_CTG3_1 HSCHR19KIR_FH15_A_HAP_CTG3_1 HSCHR19KIR_RP5_B_HAP_CTG3_1

I just don't know how to do that for RN6 :) Best David

vjcitn commented 1 year ago

Can't you make the dimnames in the ChromatinAssay instance have the chr prefix?

.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : chr [1:134655] "1-173268-173874" "1-174544-175414" "1-669086-670122" "1-1116585-1117435" ...

Unfortunately I can't get a ChromatinAssay instance to investigate:

> example(ChromatinAssay)
Warning message:
In example(ChromatinAssay) :
  'ChromatinAssay' has a help file but no examples
Dalhte commented 1 year ago

Hi Vince. Thanks for the advice. If I have no other choice I will but I would rather have a "suited" BSgenome package if possible :) Best D.

Dalhte commented 1 year ago

I tried again forging the BSgenome.NCBI format and it worked. I redownloaded the RN6 from NCBI this time (and not from ensembl, but not sure it's changing something), I changed the name of the fasta fie to remode the Chr prefix, and I think most importantly I renamed the final package (somehow I wonder if Rstudio was not using the same "in cache" files and not the one I was creating) Now I have :

BSgenome.Rnorvegicus.NCBI.rn6.4 Rat genome:

organism: rattus norvegicus (Rat)

genome: Rnor_6.0

provider: NCBI

release date: 2014

46 sequences:

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 X

Y MT 1.unlocalized.scaf 2.unlocalized.scaf 3.unlocalized.scaf 4.unlocalized.scaf 5.unlocalized.scaf

6.unlocalized.scaf 7.unlocalized.scaf 8.unlocalized.scaf 9.unlocalized.scaf 10.unlocalized.scaf 11.unlocalized.scaf 12.unlocalized.scaf

13.unlocalized.scaf 14.unlocalized.scaf 15.unlocalized.scaf 16.unlocalized.scaf 17.unlocalized.scaf 18.unlocalized.scaf 19.unlocalized.scaf

20.unlocalized.scaf X.unlocalized.scaf Y.unlocalized.scaf unplaced.scaf

(use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given sequence)

And RegionStats(PG24, genome = BSgenome.Rnorvegicus.NCBI.rn6.4) is working just fine :) Thanks a lot for your help, it helped me to understand what was not working. Best David

vjcitn commented 1 year ago

Maybe file an issue at Signac to promote a more flexible approach to working with diverse seqnames formats?

hpages commented 1 year ago

Or, if you can't add the chr prefix to the sequence names in your PG24 object, simply remove it from the sequence names in the BSgenome.Rnorvegicus.UCSC.rn6 object:

> seqnames(BSgenome.Rnorvegicus.UCSC.rn6) <- gsub("^chr", "", seqnames(BSgenome.Rnorvegicus.UCSC.rn6))
> BSgenome.Rnorvegicus.UCSC.rn6
Rat genome:
# organism: Rattus norvegicus (Rat)
# genome: rn6
# provider: UCSC
# release date: Jul. 2014
# 953 sequences:
#   1                        2                        3                       
#   4                        5                        6                       
#   7                        8                        9                       
#   10                       11                       12                      
#   13                       14                       15                      
#   ...                      ...                      ...                     
#   Un_KL568505v1            Un_KL568506v1            Un_KL568507v1           
#   Un_KL568508v1            Un_KL568509v1            Un_KL568510v1           
#   Un_KL568511v1            Un_KL568512v1            Un_KL568513v1           
#   Un_KL568514v1            Un_KL568515v1            Un_KL568516v1           
#   Un_KL568517v1            Un_KL568518v1                                    
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)

Sequence name incompatibilities are easy to work around and never require forging a new BSgenome package from scratch.

This was really a question about basic usage. These questions are always better asked on the Bioconductor support site where it would have been quickly solved, had you shown people the problem that you are running into (i.e. the code that's producing the error).

Dalhte commented 1 year ago

Hello I just tried seqnames(BSgenome.Rnorvegicus.UCSC.rn6) <- gsub("^chr", "", seqnames(BSgenome.Rnorvegicus.UCSC.rn6)) It's kind off working but not fully as there is other differences between NCBI and USCS (such as the way mitochondrial DNA or un alignated sequences are called). Forging an ad hoc bsgenome seems more simple than correcting each format differences and seems to work just fine :) Best David

hpages commented 1 year ago

The differences between NCBI and UCSC can be eliminated with seqlevelsStyle(genome) <- "NCBI", as I already explained above. So you need to switch the names to NCBI before removing the chr prefix:

library(BSgenome.Rnorvegicus.UCSC.rn6)
genome <- BSgenome.Rnorvegicus.UCSC.rn6
seqlevelsStyle(genome) <- "NCBI"
seqnames(genome) <- gsub("^chr", "", seqnames(genome))

Then:

> genome
Rat genome:
# organism: Rattus norvegicus (Rat)
# genome: rn6
# provider: UCSC
# release date: Jul. 2014
# 953 sequences:
#   1            2            3            4            5           
#   6            7            8            9            10          
#   11           12           13           14           15          
#   16           17           18           19           20          
#   X            Y            MT           1_random.1   1_random.3  
#   ...          ...          ...          ...          ...         
#   Un.71        Un.72        Un.73        Un.76        Un.77       
#   Un.78        Un.8         Un.80        Un.81        Un.82       
#   Un.84        Un.86        Un.88        Un.89        Un.9        
#   Un.90        Un.92        Un.93        Un.94        Un.96       
#   Un.97        Un.98        Un.99                                 
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)

This is way simpler than spending hours trying to forge your own BSgenome package from scratch :)

Cheers, H.