LuyiTian / FLAMES

Full-length transcriptome splicing and mutation analysis
GNU General Public License v3.0
73 stars 11 forks source link

GTF format error? #9

Closed danledinh closed 4 years ago

danledinh commented 4 years ago

Do I need a specific GFF/GTF format?

I am getting this error:

Traceback (most recent call last):
  File "/FLAMES/python/bulk_long_pipeline.py", line 243, in <module>
    bulk_long_pipeline(args)
  File "/FLAMES/python/bulk_long_pipeline.py", line 171, in bulk_long_pipeline
    gff3_to_bed12(args.minimap2_dir, args.gff3, tmp_bed)
  File "/FLAMES/python/minimap2_align.py", line 17, in gff3_to_bed12
    print subprocess.check_output([cmd], shell=True, stderr=subprocess.STDOUT)
  File "/miniconda/lib/python2.7/subprocess.py", line 223, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['paftools.js gff2bed /gnet/is6/p04/data/dnaseq/analysis/led13/genomes/GCA_000001405.15_GRCh
38_full_analysis_set.refseq_annotation.gtf > /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6310_q10_l300_flames/tmp.splice_anno.
bed12']' returned non-zero exit status 1

here is the head of my GTF file:

#gtf-version 2.2
#!genome-build GRCh38
#!genome-build-accession NCBI_Assembly:GCA_000001405.15
#!annotation-date 01/25/2019
#!annotation-source NCBI Homo sapiens Updated Annotation Release 109.20190125
chr1    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; db_xref "GeneID:100287102"; db_xref "HGNC
:HGNC:37102"; description "DEAD/H-box helicase 11 like 1"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseu
do "true";
chr1    BestRefSeq      exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "1";
chr1    BestRefSeq      exon    12613   12721   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "2";
chr1    BestRefSeq      exon    13221   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "3";
chr1    BestRefSeq      gene    14362   29370   .       -       .       gene_id "WASH7P"; db_xref "GeneID:653635"; db_xref "HGNC:HGN
C:38034"; description "WAS protein family homolog 7, pseudogene"; gbkey "Gene"; gene "WASH7P"; gene_biotype "transcribed_pseudogene
LuyiTian commented 4 years ago

Yes I am aware of such problem. paftools.js is provided within minimap2 to process the gtf and generate exon boundarys, which helps the long read alignment. It is not very carefully written to provide full compatibility of different gtf and gff files. the gtf/gff format is very messy in its last column, which often contains essential informations such as gene id. The script works for Ensembl and Gencode gtf/gff format so you can try that.

BTW have you looked at this: https://github.com/lh3/minimap2/issues/422 they seems to use the same gtf file and it is said to be fixed.