KarchinLab / open-cravat

A modular annotation tool for genomic variants
MIT License
113 stars 27 forks source link

how to handle scaffold regions? #58

Closed loipf closed 3 years ago

loipf commented 3 years ago

Hi all,

I would like to also include scaffold regions into my analysis with opencravat. is this possible? Most of them are unplaced pieces but some also on chromosomes. I know the added value is probably not much but I would still like to include the information of them.

my .vcf file has the following CHROM ids: [1-22, X, Y, MT, GL000220.1, GL000008.2, KI270733.1, KI270386.1, ...] The reference genome used is from ensemble [ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz].

the error .log looks like this:

#
SOURCE:error.mapper         
LINE:8609
INPUT:1785858   chrGL000008.2   131 C   T
ERROR:near ".2": syntax error
#
SOURCE:error.mapper         
LINE:8610
INPUT:1785859   chrGL000008.2   142 G   A
ERROR:near ".2": syntax error
#

note how the mapper couldn't interpret the original id and changed the name by adding a chr in front of it.

should the scaffold names be changed or the version number be removed or is the hg38 annotation which opencravat is based on just not able to interpret these regions? If it is a misunderstanding on my side and these scaffold regions don't add any value at all and that's why you left them out, I could live with that too.

thank you

rkimoakbioinformatics commented 3 years ago

Hi loipf,

There are a few sources of the issue:

Your question raises an interesting question about the data source for hg38. We have been considering using Gencode directly in future hg38 versions, which will eliminate most of the sources of the issue you observed. Gencode and Ensembl gene annotations are supposed to be mostly the same.

In the meantime, I have just published a workaround version of hg38 v1.10.0. Please do

oc module install hg38

and try with your input. What it does is that it will try to find a matching scaffold name. For example, if GL000009.2 is given, then it will find a chromosome/scaffold name that contains GL000009 and use it, which is chr14_GL000009v2_random in the current hg38. Let me know how it works with your input.

loipf commented 3 years ago

wow thanks for the fast response and explanation, that helped to understand the problem. the fast fix works and searches for the fitting equivalent and there are no errors for these anymore. for a few scaffolds ids (probably not in the current hg38 and due to the newer version) it still throws the syntax error, but this is totally fine and I am happy with the fix.

can be closed