genomeannotation / GAG

Generates an NCBI .tbl file of annotations on a genome.
MIT License
64 stars 20 forks source link

New features (refs #163) #182

Closed vivekkrish closed 7 years ago

vivekkrish commented 7 years ago

Per discussion in #163, we're issuing a pull request (to the dev branch), containing the following changes:

  1. Support for various XRNA types: 'mRNA', 'tRNA', 'rRNA', 'ncRNA', 'miRNA', 'miRNA_primary_transcript', 'snRNA', 'snoRNA', 'lnc_RNA', 'antisense_lncRNA', 'antisense_RNA', 'pseudogenic_transcript', 'pseudogenic_tRNA', 'transcript_region'

  2. Support for multiple Gene types: 'gene', 'pseudogene', 'transposable_element_gene'

  3. Support for non-Gene (orphan) feature types: 'iDNA' (Internal Eliminated Sequence), 'misc_feature' (Chromosome Breakage Sequence)

  4. Extend annotations TSV file to support 4-column format. 4th column used to specify feature type to which annotations should to be attached. Example:

    AT1G02145   db_xref TAIR:AT1G02145  gene
    AT1G02145   gene    ALG12   gene
    AT1G02145   gene_syn    EBS4    gene
    AT1G02145   gene_syn    EMS-MUTAGENIZED BRI1(BRASSINOSTEROID INSENSITIVE 1) SUPPRESSOR 4    gene
    AT1G02145   gene_syn    homolog of asparagine-linked glycosylation 12   gene
    AT1G02145.1 db_xref TAIR:AT1G02145  CDS
    AT1G02145.1 db_xref TAIR:AT1G02145  mRNA
    AT1G02145.1 inference   Similar to RNA sequence, EST:INSD:EG518891.1,INSD:BP577734.1,INSD:BP581601.1, INSD:EG518892.1,INSD:BP575147.1,INSD:EL046732.1, INSD:EG492351.1,INSD:EL075755.1,INSD:EG501755.1, INSD:EG518889.1,INSD:AU226395.1,INSD:BP577546.1, INSD:EG518890.1    mRNA
    AT1G02145.1 inference   Similar to RNA sequence, EST:INSD:EG518891.1,INSD:EG464605.1,INSD:BP577734.1, INSD:EG492362.1,INSD:BP581601.1,INSD:EG492385.1, INSD:EG518892.1,INSD:EG445265.1,INSD:EG492396.1, INSD:BP575147.1,INSD:EL046732.1,INSD:EG464599.1, INSD:EL075755.1,INSD:EG492351.1,INSD:EG492329.1, INSD:EG518889.1,INSD:BP577546.1,INSD:AU226395.1, INSD:EG492307.1,INSD:EG518890.1  CDS
    AT1G02145.1 inference   similar to RNA sequence, mRNA:INSD:EF183364.1,INSD:DQ492199.1   CDS
    AT1G02145.1 note    homolog of asparagine-linked glycosylation 12 (ALG12); FUNCTIONS IN: alpha-1,6-mannosyltransferase activity; INVOLVED IN: ER-associated protein catabolic process, protein amino acid terminal N-glycosylation; LOCATED IN: endomembrane system, intrinsic to endoplasmic reticulum membrane; CONTAINS InterPro DOMAIN/s: Alg9-like mannosyltransferase (InterPro:IPR005599). CDS
    AT1G02145.1 product homolog of asparagine-linked glycosylation 12   CDS
    AT1G02145.1 product homolog of asparagine-linked glycosylation 12   mRNA
    AT1G02145.1 protein_id  AEE27389.1  CDS
  5. Other minor changes: i. Allow specifying a Genome Center (which is encoded into the {transcript,protein}_id) ii. Allow specifying a Whole Genome Sequence (WGS) accession prefix (used in place of a Genome Center Tag)

Due to the introduction of 4 (feature type specific annotations), the code now fails to pass the following tests:

$ ./test/gene_tests.py 2>&1 | grep '^FAIL'
FAIL: test_to_tbl_positive (__main__.TestGene)
FAIL: test_to_tbl_positive_nostart_nostop (__main__.TestGene)
FAIL: test_to_tbl_positive_nostart_stop (__main__.TestGene)
FAIL: test_to_tbl_positive_start_nostop (__main__.TestGene)
FAIL: test_to_tbl_positive_with_name (__main__.TestGene)
FAILED (failures=5)

$ ./test/xrna_tests.py 2>&1 | grep '^FAIL'
FAIL: test_to_tbl_replace_Dbxref_with_db_xref (__main__.TestXRNA)
FAIL: test_to_tbl_with_annotations (__main__.TestXRNA)
FAIL: test_to_tbl_with_product (__main__.TestXRNA)
FAILED (failures=3)