DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
470 stars 114 forks source link

Bugs in hisat2_extract_splice_sites.py #107

Open johnomics opened 7 years ago

johnomics commented 7 years ago

I am running HISAT 2.0.5 and working with the Arabidopsis thaliana annotation Araport11, specifically the GTF. This file is not processed by hisat2_extract_splice_sites.py. It produces no output:

$ hisat2_extract_splice_sites.py Araport11_GFF3_genes_transposons.201606.gtf
$

And running verbosely produces an error:

$ hisat2_extract_splice_sites.py -v Araport11_GFF3_genes_transposons.201606.gtf
genes: 0, genes with multiple isoforms: 0
Traceback (most recent call last):
  File "/biol/programs/bin/hisat2_extract_splice_sites.py", line 138, in <module>
    extract_splice_sites(args.gtf_file, args.verbose)
  File "/biol/programs/bin/hisat2_extract_splice_sites.py", line 107, in extract_splice_sites
    len(trans), sum(trans_lengths.elements())/len(trans)),
ZeroDivisionError: integer division or modulo by zero

The GTF records look like this:

$ head -n1 Araport11_GFF3_genes_transposons.201606.gtf
Chr1    Araport11   5UTR    3631    3759    .   +   .   transcript_id "AT1G01010.1"; gene_id "AT1G01010"

So hisat2_extract_splice_sites.py is failing because it ignores the last attribute (line 52):

        for attr in values.split(';')[:-1]:

Strictly speaking, the Araport11 GTF is malformed, as every attribute should end with a semicolon, even the last (GTF format definition), so technically the script is correct.

But as the script is only interested in the gene_id and transcript_id attributes, perhaps attr in the loop above could be checked for one of these values before it is added to values_dict, rather than after the complete values_dict has been created, and the [:-1] could be removed?

alexgilgal commented 7 years ago

Hey! I have just seen your post because I had the same problem. The key is that the script does not produces the file itself. You have to run it in this way:

python hisat2_extract_splice_sites.py path/annotation.gtf > path/splicingsites.txt

hsiaoyi0504 commented 6 years ago

@infphilo I think this can be closed.