churchill-lab / g2gtools

Personal diploid genome creation and coordinate conversion
http://churchill-lab.github.io/g2gtools
21 stars 9 forks source link

AssertionError: start must be less than end #4

Closed jpdna closed 5 years ago

jpdna commented 7 years ago

I'm trying to follow the instructions at: https://g2gtools.readthedocs.io/en/latest/usage.html

to apply g2gtools to the strain: FVB_NJ

for which I downloaded data here. ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/

I have currently getting the final error, despite the variants in the input VCFs appearing to be well formed:

g2gtools convert -c FVB_NJ/REF-to-FVB_NJ.chain -i ~/Downloads/Mus_musculus.GRCm38.87.gtf -f gtf -o FVB_NJ/FVB_NJ.gtf
CHAIN FILE: /Users/paschalj/install_anaconda/run2/FVB_NJ/REF-to-FVB_NJ.chain
INPUT FILE: /Users/paschalj/Downloads/Mus_musculus.GRCm38.87.gtf
OUTPUT FILE: /Users/paschalj/install_anaconda/run2/FVB_NJ/FVB_NJ.gtf
UNMAPPED FILE: /Users/paschalj/install_anaconda/run2/FVB_NJ/FVB_NJ.gtf.unmapped
Parsing chain file...
Traceback (most recent call last):
  File "/Users/paschalj/anaconda2/envs/g2gtools/bin/g2gtools", line 4, in <module>
    __import__('pkg_resources').run_script('g2gtools==0.1.31', 'g2gtools')
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 744, in run_script
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py", line 1499, in run_script
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.1.31-py2.7.egg-info/scripts/g2gtools", line 117, in <module>
    G2GToolsApp()
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.1.31-py2.7.egg-info/scripts/g2gtools", line 75, in __init__
    getattr(self, args.command)()
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.1.31-py2.7.egg-info/scripts/g2gtools", line 78, in convert
    g2gtools.g2g_commands.command_convert(sys.argv[2:], self.script_name + ' convert')
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/g2g_commands.py", line 89, in command_convert
    file_convert(args.chain, args.input, args.output, args.format, args.reverse)
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/g2g.py", line 62, in file_convert
    convert_gtf_file(chain_file=chain_file, input_file=input_file, output_file=output_file, reverse=reverse)
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/gtf.py", line 137, in convert_gtf_file
    chain_file = ChainFile(chain_file, reverse=reverse)
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/chain.py", line 81, in __init__
    self.parse(self.is_reversed)
  File "/Users/paschalj/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/chain.py", line 282, in parse
    mapping_tree[t_name].insert_interval(Interval(t_from, t_from+size, (q_name, q_from, q_from+size, q_strand, None, None, None, None)))
  File "intersection.pyx", line 291, in bx.intervals.intersection.Interval.__init__ (lib/bx/intervals/intersection.c:3789)
AssertionError: start must be less than end

I note that in also trying this with the older seqnature, I found the offending variants seems to occur in cases where there are consecutive SNPs, but not all such cases - though still many thousands across genome.

Any suggestions about how to solve this would be much appreciated a g2gtools is exactly what we need here to be able to more accurately map some RNA-seq data for this strain.

kbchoi-jax commented 7 years ago

I cannot replicate your error. Your chain file must be wrong. Try delete any existing fasta index file (fai format), and run 'vcf2chain' again.

mfrancesconi81 commented 6 years ago

Same here.

I created two chain files using a vcf file with INDELs from a joint call of variants in two inbred lines created using GATK. The reference is from Ensembl

the chain files have been created without errors or warnings.

I then used the chain files to liftover the corresponding gtf from Ensembl and I got the following message.

Traceback (most recent call last): File "/software/bl/el7.2/anaconda2/envs/g2gtools/bin/g2gtools", line 4, in import('pkg_resources').run_script('g2gtools==0.1.31', 'g2gtools') File "/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/init.py", line 658, in run_script self.require(requires)[0].run_script(script_name, ns) File "/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/init.py", line 1438, in run_script exec(code, namespace, namespace) File "/nfs/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.1.31-py2.7.egg-info/scripts/g2gtools", line 117, in G2GToolsApp() File "/nfs/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.1.31-py2.7.egg-info/scripts/g2gtools", line 75, in init getattr(self, args.command)() File "/nfs/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.1.31-py2.7.egg-info/scripts/g2gtools", line 78, in convert g2gtools.g2g_commands.command_convert(sys.argv[2:], self.script_name + ' convert') File "/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/g2g_commands.py", line 89, in command_convert file_convert(args.chain, args.input, args.output, args.format, args.reverse) File "/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/g2g.py", line 62, in file_convert convert_gtf_file(chain_file=chain_file, input_file=input_file, output_file=output_file, reverse=reverse) File "/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/gtf.py", line 137, in convert_gtf_file chain_file = ChainFile(chain_file, reverse=reverse) File "/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/chain.py", line 81, in init self.parse(self.is_reversed) File "/software/bl/el7.2/anaconda2/envs/g2gtools/lib/python2.7/site-packages/g2gtools/chain.py", line 282, in parse mapping_tree[t_name].insert_interval(Interval(t_from, t_from+size, (q_name, q_from, q_from+size, q_strand, None, None, None, None))) File "intersection.pyx", line 291, in bx.intervals.intersection.Interval.init (lib/bx/intervals/intersection.c:3789) AssertionError: start must be less than end

the Ensembl reference .fasta file include many small scaffolds. I see that in another thread this has been reported to be a problem.

Has this issue been solved or do you think a large number of scaffolds could be the problem?

Thanks,

Mirko

kbchoi-jax commented 6 years ago

You are creating two separate annotation files, correct? This error usually happens when you have a stale fasta index file. I don't think scaffolds in reference would ever mattered, but I cannot recall whether I tested it. Could you please try it without scaffolds? Thanks!

kbchoi-jax commented 6 years ago

Dear @jpdna,

Sorry, were you able to get what you wanted?

mfrancesconi81 commented 6 years ago

Thanks for the prompt reply, yes I'm creating two separate annotations in two separate runs from two chain files created from the same vcf that includes indels for both lines. I paste only the error from one run. Both runs gave me the same error. What do you mean by a stale fasta index file?

Anyway I'll try without scaffolds and let you know.

Thanks,

Mirko

kbchoi-jax commented 6 years ago

I meant you may want to remove .fai file if exists.

mfrancesconi81 commented 6 years ago

I tried removing the fasta index and it did't work.

again the chain files have been created without errors or warnings, but then I get errors when lifting over the .gtf

Something I didn't notice before (but might have happened anyway) in the stats output during the creation of chain files: I get high number of conflicting VCF entries and the statistics are veri similar for al the chromosomes and scaffolds (see below)

Chromosome: 2L STATISTICS 1,311 HETEROZYGOUS 724 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF Chromosome: 2R STATISTICS 1,366 HETEROZYGOUS 642 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF Chromosome: 3L STATISTICS 1,572 HETEROZYGOUS 806 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF Chromosome: 3R STATISTICS 1,715 HETEROZYGOUS 654 NOT RELEVANT 35,888 ACCEPTED 106,416 CONFLICTING VCF ENTRIES 81,498 SAME AS REF

Once I remove the scaffolds everything seems fine: I created two chian files and used them to lift over .gtf files this time without errors.

I'm still in the process to check that everything is fine but this is for sure an improvement and tells you that something is definitely wrong when there are many scaffolds in the genome.

M.