Closed jpdna closed 5 years ago
I cannot replicate your error. Your chain file must be wrong. Try delete any existing fasta index file (fai format), and run 'vcf2chain' again.
Same here.
I created two chain files using a vcf file with INDELs from a joint call of variants in two inbred lines created using GATK. The reference is from Ensembl
the chain files have been created without errors or warnings.
I then used the chain files to liftover the corresponding gtf from Ensembl and I got the following message.
Traceback (most recent call last):
File "/software/bl/el7.2/anaconda2/envs/g2gtools/bin/g2gtools", line 4, in
the Ensembl reference .fasta file include many small scaffolds. I see that in another thread this has been reported to be a problem.
Has this issue been solved or do you think a large number of scaffolds could be the problem?
Thanks,
Mirko
You are creating two separate annotation files, correct? This error usually happens when you have a stale fasta index file. I don't think scaffolds in reference would ever mattered, but I cannot recall whether I tested it. Could you please try it without scaffolds? Thanks!
Dear @jpdna,
Sorry, were you able to get what you wanted?
Thanks for the prompt reply, yes I'm creating two separate annotations in two separate runs from two chain files created from the same vcf that includes indels for both lines. I paste only the error from one run. Both runs gave me the same error. What do you mean by a stale fasta index file?
Anyway I'll try without scaffolds and let you know.
Thanks,
Mirko
I meant you may want to remove .fai file if exists.
I tried removing the fasta index and it did't work.
again the chain files have been created without errors or warnings, but then I get errors when lifting over the .gtf
Something I didn't notice before (but might have happened anyway) in the stats output during the creation of chain files: I get high number of conflicting VCF entries and the statistics are veri similar for al the chromosomes and scaffolds (see below)
Chromosome: 2L STATISTICS 1,311 HETEROZYGOUS 724 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF Chromosome: 2R STATISTICS 1,366 HETEROZYGOUS 642 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF Chromosome: 3L STATISTICS 1,572 HETEROZYGOUS 806 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF Chromosome: 3R STATISTICS 1,715 HETEROZYGOUS 654 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF
Once I remove the scaffolds everything seems fine: I created two chian files and used them to lift over .gtf files this time without errors.
I'm still in the process to check that everything is fine but this is for sure an improvement and tells you that something is definitely wrong when there are many scaffolds in the genome.
M.
I'm trying to follow the instructions at: https://g2gtools.readthedocs.io/en/latest/usage.html
to apply g2gtools to the strain: FVB_NJ
for which I downloaded data here. ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/
I have currently getting the final error, despite the variants in the input VCFs appearing to be well formed:
I note that in also trying this with the older seqnature, I found the offending variants seems to occur in cases where there are consecutive SNPs, but not all such cases - though still many thousands across genome.
Any suggestions about how to solve this would be much appreciated a g2gtools is exactly what we need here to be able to more accurately map some RNA-seq data for this strain.