how to handle scaffold regions?

loipf commented 3 years ago

Hi all,

I would like to also include scaffold regions into my analysis with opencravat. is this possible? Most of them are unplaced pieces but some also on chromosomes. I know the added value is probably not much but I would still like to include the information of them.

my .vcf file has the following CHROM ids: [1-22, X, Y, MT, GL000220.1, GL000008.2, KI270733.1, KI270386.1, ...] The reference genome used is from ensemble [ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz].

the error .log looks like this:

#
SOURCE:error.mapper         
LINE:8609
INPUT:1785858   chrGL000008.2   131 C   T
ERROR:near ".2": syntax error
#
SOURCE:error.mapper         
LINE:8610
INPUT:1785859   chrGL000008.2   142 G   A
ERROR:near ".2": syntax error
#

note how the mapper couldn't interpret the original id and changed the name by adding a chr in front of it.

should the scaffold names be changed or the version number be removed or is the hg38 annotation which opencravat is based on just not able to interpret these regions? If it is a misunderstanding on my side and these scaffold regions don't add any value at all and that's why you left them out, I could live with that too.

thank you

rkimoakbioinformatics commented 3 years ago

Hi loipf,

There are a few sources of the issue:

hg38's source data is the USCS-curated version of Gencode V33, and in this UCSC-version scaffold names are different from the original source. Specifically, the names follow the formula of "chr" + reference chromosome number (1-22, X, Y, M) [+ scaffold name + ("_alt", "_random", or "_fix")]. For example,
```
chr1
chr1_GL383518v1_alt
chr1_KI270711v1_random
chr1_KN196472v1_fix
```
There are differences in chromosome and scaffold names between Ensembl and Gencode V33 since Gencode uses the Genome Reference Consortium accessions.
hg38 adds chr to the names if they do not start with chr for convenience since all chromosome and scaffold names of the UCSC version of Gencode start with chr, so that input files with just chromosome numbers can be used without modification.
Another possible source of the issue may be the time of the release of Gencode V33 and Ensembl 101.

Your question raises an interesting question about the data source for hg38. We have been considering using Gencode directly in future hg38 versions, which will eliminate most of the sources of the issue you observed. Gencode and Ensembl gene annotations are supposed to be mostly the same.

In the meantime, I have just published a workaround version of hg38 v1.10.0. Please do

oc module install hg38

and try with your input. What it does is that it will try to find a matching scaffold name. For example, if GL000009.2 is given, then it will find a chromosome/scaffold name that contains GL000009 and use it, which is chr14_GL000009v2_random in the current hg38. Let me know how it works with your input.

loipf commented 3 years ago

wow thanks for the fast response and explanation, that helped to understand the problem. the fast fix works and searches for the fitting equivalent and there are no errors for these anymore. for a few scaffolds ids (probably not in the current hg38 and due to the newer version) it still throws the syntax error, but this is totally fine and I am happy with the fix.

can be closed

KarchinLab / open-cravat

how to handle scaffold regions? #58