sequence levels in 'query' not found in alignment files

damisiak commented 1 year ago

Dear all,

I generated txdb from Ensemble 110 and aligned my fastq files with human_Ensembl_110_hg38 like here:

txdb <-makeTxDbFromEnsembl(organism="Homo sapiens", release=110, circ_seqs=NULL, server="ensembldb.ensembl.org", username="anonymous", password=NULL, port=0L, tx_attrib=NULL)

Aligned files with:

proj <- qAlign(sampleFile = sampleFile, paired = "fr", genome = genomeFile, aligner = "Rhisat2", splicedAlignment = TRUE)

which worked fine.

Trying to count data at exonic regions gives the following:

cntEx <- qCount(proj, regS$exons, orientation = "any") Error in qCount(proj, regS$exons, orientation = "any") : sequence levels in 'query' not found in alignment files: GL000009.2, GL000194.1,....some more...

I understand, that in the .fa file are more/other chromosomes listed as in the txdb file, but how can this be at the same versions and how can I force qCount to make only a warning and not stop with an error? Is there a way to choose only common seqlevels before?

Thanks in advance.

sessionInfo() R version 4.3.1 (2023-06-16) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Berlin tzcode source: system (glibc)

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] QuasR_1.40.1 Rbowtie_1.40.0 TxDb.Hsapiens.UCSC.hg38.refGene_3.15.0 TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0 GenomicFeatures_1.52.2 AnnotationDbi_1.62.2
[7] Biobase_2.60.0 GenomicRanges_1.52.1 GenomeInfoDb_1.36.4 IRanges_2.34.1 S4Vectors_0.38.2 BiocGenerics_0.46.0
[13] eisaR_1.12.0

loaded via a namespace (and not attached): [1] tidyselect_1.2.0 dplyr_1.1.3 blob_1.2.4 filelock_1.0.2 Biostrings_2.68.1 bitops_1.0-7 fastmap_1.1.1 RCurl_1.98-1.12 BiocFileCache_2.8.0
[10] VariantAnnotation_1.46.0 GenomicAlignments_1.36.0 XML_3.99-0.14 digest_0.6.33 timechange_0.2.0 lifecycle_1.0.3 KEGGREST_1.40.1 RSQLite_2.3.1 Rhisat2_1.16.0
[19] magrittr_2.0.3 compiler_4.3.1 rlang_1.1.1 progress_1.2.2 tools_4.3.1 igraph_1.5.1 utf8_1.2.3 yaml_2.3.7 rtracklayer_1.60.1
[28] prettyunits_1.2.0 S4Arrays_1.0.6 interp_1.1-4 bit_4.0.5 curl_5.1.0 DelayedArray_0.26.7 RColorBrewer_1.1-3 xml2_1.3.5 abind_1.4-5
[37] ShortRead_1.58.0 BiocParallel_1.34.2 hwriter_1.3.2.1 grid_4.3.1 latticeExtra_0.6-30 fansi_1.0.5 edgeR_3.42.4 biomaRt_2.56.1 SummarizedExperiment_1.30.2 [46] cli_3.6.1 crayon_1.5.2 generics_0.1.3 rstudioapi_0.15.0 httr_1.4.7 rjson_0.2.21 RUnit_0.4.32 DBI_1.1.3 cachem_1.0.8
[55] stringr_1.5.0 zlibbioc_1.46.0 XVector_0.40.0 restfulr_0.0.15 matrixStats_1.0.0 vctrs_0.6.4 Matrix_1.6-1.1 hms_1.1.3 bit64_4.0.5
[64] jpeg_0.1-10 GenomicFiles_1.36.0 locfit_1.5-9.8 SGSeq_1.34.0 limma_3.56.2 glue_1.6.2 codetools_0.2-19 lubridate_1.9.3 stringi_1.7.12
[73] deldir_1.0-9 BiocIO_1.10.0 tibble_3.2.1 pillar_1.9.0 rappdirs_0.3.3 GenomeInfoDbData_1.2.10 BSgenome_1.68.0 R6_2.5.1 dbplyr_2.3.4
[82] lattice_0.21-9 RMariaDB_1.3.0 png_0.1-8 Rsamtools_2.16.0 memoise_2.0.1 Rcpp_1.0.11 MatrixGenerics_1.12.3 pkgconfig_2.0.3

mbstadler commented 1 year ago

Dear @damisiak

Unfortunately, chromosome names are not fully standardised even for a given genome assembly (like hg38) between different annotation providers, especially for the "non-standard" chromosomes like unplaced sequences.

In your case, gene annotation comes from Ensembl, but I assume that genomeFile is not from that same source. My recommendation would be to get both the genome sequence and annotation from the same source, for example for human or mouse from GENCODE (https://www.gencodegenes.org/human/). Using the GENCODE primary assembly genome fasta file will have the additional advantage that you have a non-redundant reference genome sequence (for example without duplicated regions to represent different haplotypes) and you will not lose any reads in the alignment that are wrongly classified as multi-hitters.

While using such consistent annotation is the preferred and recommended approach, you could also avoid the qCount error by first subsetting your query (regS$exons) to only include sequence levels (chromosomes) that are also present in your reference genome.

I hope that answers your question.

Regards, Michael

damisiak commented 1 year ago

Dear Michael,

thanks for the advise. So i will re-do the analysis with the new genome and annotation. Was a little bit surprised, that also the UCSC genome and annotation is not fitting.

Best,

Danny

fmicompbio / QuasR

sequence levels in 'query' not found in alignment files #43