Closed loipf closed 3 years ago
Hi loipf,
There are a few sources of the issue:
hg38
's source data is the USCS-curated version of Gencode V33, and in this UCSC-version scaffold names are different from the original source. Specifically, the names follow the formula of "chr" + reference chromosome number (1-22, X, Y, M) [+ scaffold name + ("_alt", "_random", or "_fix")]. For example,
chr1
chr1_GL383518v1_alt
chr1_KI270711v1_random
chr1_KN196472v1_fix
There are differences in chromosome and scaffold names between Ensembl and Gencode V33 since Gencode uses the Genome Reference Consortium accessions.
hg38
adds chr
to the names if they do not start with chr
for convenience since all chromosome and scaffold names of the UCSC version of Gencode start with chr
, so that input files with just chromosome numbers can be used without modification.
Another possible source of the issue may be the time of the release of Gencode V33 and Ensembl 101.
Your question raises an interesting question about the data source for hg38
. We have been considering using Gencode directly in future hg38
versions, which will eliminate most of the sources of the issue you observed. Gencode and Ensembl gene annotations are supposed to be mostly the same.
In the meantime, I have just published a workaround version of hg38
v1.10.0. Please do
oc module install hg38
and try with your input. What it does is that it will try to find a matching scaffold name. For example, if GL000009.2
is given, then it will find a chromosome/scaffold name that contains GL000009
and use it, which is chr14_GL000009v2_random
in the current hg38
. Let me know how it works with your input.
wow thanks for the fast response and explanation, that helped to understand the problem. the fast fix works and searches for the fitting equivalent and there are no errors for these anymore. for a few scaffolds ids (probably not in the current hg38 and due to the newer version) it still throws the syntax error, but this is totally fine and I am happy with the fix.
can be closed
Hi all,
I would like to also include scaffold regions into my analysis with opencravat. is this possible? Most of them are unplaced pieces but some also on chromosomes. I know the added value is probably not much but I would still like to include the information of them.
my .vcf file has the following
CHROM
ids: [1-22, X, Y, MT, GL000220.1, GL000008.2, KI270733.1, KI270386.1, ...] The reference genome used is from ensemble [ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz].the error .log looks like this:
note how the mapper couldn't interpret the original id and changed the name by adding a
chr
in front of it.should the scaffold names be changed or the version number be removed or is the hg38 annotation which opencravat is based on just not able to interpret these regions? If it is a misunderstanding on my side and these scaffold regions don't add any value at all and that's why you left them out, I could live with that too.
thank you