TBroTeam / TBro

Visualization and management of denovo transcriptomes
https://tbroteam.github.io/TBro/
10 stars 6 forks source link

Problems with uploading mapman annotations #45

Closed nterhoeven closed 7 years ago

nterhoeven commented 7 years ago

I want to upload mapman annotations generated with mercator.

I used the mercator web tool (http://www.plabipd.de/portal/mercator-sequence-annotation) to generate the files. When I now try to upload the files, I get the following error:

tbro-import annotation_mapman --skip-materialize-views --organism_id 20 --release 1.0.0 mapman.split-1-IPR.txt                

importing mapman.split-1-IPR.txt as annotation_mapman
[                                                 ]    200/167417(  0.12%), elapsed: 00:00.03 , remaining est.: 00:47.02Error: SQLSTATE[23502]: Not null violation: 7 ERROR:  null value in column "cvterm_id" violates not-null constraint
DETAIL:  Failing row contains (5827, null, 42832, 'phytol'  'PS.lightreaction, chlorophyll', 0).
Type "/bin/tbro-import --help" to get help.
Type "/bin/tbro-import <command> --help" to get help on specific command.

sample from the file, I am trying to use:

BINCODE NAME    IDENTIFIER  DESCRIPTION TYPE
'0' 'control genes' ''  ''  
'1' 'PS'    'TRINITY_GG_47742_c1_g1_i1' '(at5g38660 : 212.0) mutant has Altered acclimation responses;; ACCLIMATION OF PHOTOSYNTHESIS TO  ENVIRONMENT (APE1); CONTAINS InterPro DOMAIN/s: Protein of unknown function DUF2854 (InterPro:IPR021275), Proteasome maturation factor UMP1 (InterPro:IPR008012); BEST Arabidopsis thaliana protein match is: Proteasome maturation factor UMP1 (TAIR:AT5G38650.1). & (loc_os02g55640.1 : 152.0) no description available & (gnl|cdd|38271 : 140.0) no description available & (gnl|cdd|68901 : 140.0) no description available & (ipr008012 : 103.146324) Proteasome maturation factor UMP1 & (chl4|516344 : 87.4) no description available & (reliability: 424.0) &  (original description: no original description)'   T
'1' 'PS'    'TRINITY_GG_80266_c2_g1_i1' '(at5g38660 : 217.0) mutant has Altered acclimation responses;; ACCLIMATION OF PHOTOSYNTHESIS TO  ENVIRONMENT (APE1); CONTAINS InterPro DOMAIN/s: Protein of unknown function DUF2854 (InterPro:IPR021275), Proteasome maturation factor UMP1 (InterPro:IPR008012); BEST Arabidopsis thaliana protein match is: Proteasome maturation factor UMP1 (TAIR:AT5G38650.1). & (loc_os02g55640.1 : 152.0) no description available & (gnl|cdd|38271 : 142.0) no description available & (gnl|cdd|68901 : 137.0) no description available & (ipr008012 : 97.912544) Proteasome maturation factor UMP1 & (chl4|516344 : 89.7) no description available & (reliability: 434.0) &  (original description: no original description)'    T
'1.1'   'PS.lightreaction'  ''  ''  
'1.1.1' 'PS.lightreaction.photosystem II'   ''  ''  
'1.1.1.1'   'PS.lightreaction.photosystem II.LHC-II'    'TRINITY_GG_44259_c0_g1_i1' '(at3g08940 : 449.0) Lhcb4.2 protein (Lhcb4.2, protein involved in the light harvesting complex of photosystem II; light harvesting complex photosystem II (LHCB4.2); FUNCTIONS IN: chlorophyll binding; INVOLVED IN: response to blue light, response to red light, response to far red light; LOCATED IN: thylakoid, chloroplast thylakoid membrane, chloroplast, membrane, chloroplast envelope; EXPRESSED IN: 27 plant structures; EXPRESSED DURING: 14 growth stages; CONTAINS InterPro DOMAIN/s: Chlorophyll A-B binding protein (InterPro:IPR001344); BEST Arabidopsis thaliana protein match is: light harvesting complex photosystem II (TAIR:AT5G01530.1); Has 2312 Blast hits to 2241 proteins in 218 species: Archae - 0; Bacteria - 0; Metazoa - 4; Fungi - 0; Plants - 1976; Viruses - 0; Other Eukaryotes - 332 (source: NCBI BLink). & (loc_os07g37240.1 : 441.0) no description available & (q93wd2|cb29_chlre : 242.0) Chlorophyll a-b binding protein CP29 - Chlamydomonas reinhardtii & (chl4|517514 : 242.0) no description available & (gnl|cdd|84819 : 167.0) no description available & (ipr022796 : 113.24219) Chlorophyll A-B binding protein & (reliability: 898.0) &  (original description: no original description)'  T
'1.1.1.1'   'PS.lightreaction.photosystem II.LHC-II'    'TRINITY_GG_48217_c0_g1_i1' '(at5g54270 : 463.0) Lhcb3 protein is a component of the main light harvesting chlorophyll a/b-protein complex of Photosystem II (LHC II).; light-harvesting chlorophyll B-binding protein 3 (LHCB3); FUNCTIONS IN: structural molecule activity; INVOLVED IN: photosynthesis; LOCATED IN: light-harvesting complex, thylakoid, chloroplast thylakoid membrane, chloroplast; EXPRESSED IN: 24 plant structures; EXPRESSED DURING: 14 growth stages; CONTAINS InterPro DOMAIN/s: Chlorophyll A-B binding protein (InterPro:IPR001344); BEST Arabidopsis thaliana protein match is: photosystem II light harvesting complex gene 2.1 (TAIR:AT2G05100.1); Has 1807 Blast hits to 1807 proteins in 277 species: Archae - 0; Bacteria - 0; Metazoa - 736; Fungi - 347; Plants - 385; Viruses - 0; Other Eukaryotes - 339 (source: NCBI BLink). & (loc_os07g37550.1 : 452.0) no description available & (p27523|cb23_horvu : 443.0) Chlorophyll a-b binding protein of LHCII type III, chloroplast precursor (CAB) - Hordeum vulgare (Barley) & (chl4|520113 : 324.0) no description available & (gnl|cdd|84819 : 180.0) no description available & (ipr022796 : 114.487404) Chlorophyll A-B binding protein & (reliability: 926.0) &  (original description: no original description)'   T
'1.1.1.1'   'PS.lightreaction.photosystem II.LHC-II'    'TRINITY_GG_49904_c0_g1_i1' '(p27495|cb24_tobac : 483.0) Chlorophyll a-b binding protein 40, chloroplast precursor (LHCII type I CAB-40) (LHCP) - Nicotiana tabacum (Common tobacco) & (loc_os01g41710.1 : 443.0) no description available & (at1g29930 : 431.0) Subunit of light-harvesting complex II (LHCII),which absorbs light and transfers energy to the photosynthetic reaction center.; chlorophyll A/B binding protein 1 (CAB1); FUNCTIONS IN: chlorophyll binding; INVOLVED IN: photosynthesis; LOCATED IN: light-harvesting complex, thylakoid, chloroplast thylakoid membrane, chloroplast; EXPRESSED IN: guard cell, juvenile leaf, cultured cell; CONTAINS InterPro DOMAIN/s: Chlorophyll A-B binding protein (InterPro:IPR001344); BEST Arabidopsis thaliana protein match is: chlorophyll A/B binding protein 3 (TAIR:AT1G29910.1); Has 2419 Blast hits to 2339 proteins in 222 species: Archae - 0; Bacteria - 0; Metazoa - 4; Fungi - 0; Plants - 2091; Viruses - 0; Other Eukaryotes - 324 (source: NCBI BLink). & (chl4|520113 : 360.0) no description available & (gnl|cdd|84819 : 181.0) no description available & (ipr022796 : 114.72379) Chlorophyll A-B binding protein & (reliability: 862.0) &  (original description: no original description)'   T
'1.1.1.1'   'PS.lightreaction.photosystem II.LHC-II'    'TRINITY_GG_49904_c0_g2_i1' '(p27495|cb24_tobac : 479.0) Chlorophyll a-b binding protein 40, chloroplast precursor (LHCII type I CAB-40) (LHCP) - Nicotiana tabacum (Common tobacco) & (loc_os01g41710.1 : 441.0) no description available & (at1g29930 : 427.0) Subunit of light-harvesting complex II (LHCII),which absorbs light and transfers energy to the photosynthetic reaction center.; chlorophyll A/B binding protein 1 (CAB1); FUNCTIONS IN: chlorophyll binding; INVOLVED IN: photosynthesis; LOCATED IN: light-harvesting complex, thylakoid, chloroplast thylakoid membrane, chloroplast; EXPRESSED IN: guard cell, juvenile leaf, cultured cell; CONTAINS InterPro DOMAIN/s: Chlorophyll A-B binding protein (InterPro:IPR001344); BEST Arabidopsis thaliana protein match is: chlorophyll A/B binding protein 3 (TAIR:AT1G29910.1); Has 2419 Blast hits to 2339 proteins in 222 species: Archae - 0; Bacteria - 0; Metazoa - 4; Fungi - 0; Plants - 2091; Viruses - 0; Other Eukaryotes - 324 (source: NCBI BLink). & (chl4|520113 : 360.0) no description available & (gnl|cdd|84819 : 181.0) no description available & (ipr022796 : 114.72379) Chlorophyll A-B binding protein & (reliability: 854.0) &  (original description: no original description)'   T

Is this a problem with the import script or the data?

iimog commented 7 years ago

Thanks for reporting this. The default output of Mapman/Mercator might have changed since implementation of this feature. Looking at the source code the expected output format is a file that is separated into three sections (header, mapping, footer):

//if..elseif..else: check which section we are in
// header, looks like <BINCODE>\t<H_DESC>
// ...
//mapping, looks like <BINCODE>, <H_DESC>, <srcfeature_name>, <feature_description>, "T"
// ...
//footer, looks like: <BINCODE>, <H_DESC>, <CHEM>, <C_DESC>, "M"

All bincodes have to be defined in the header because there dbxrefs and cvterms are created and stored for usage when parsing the mapping section. If a bincode is missing in the header the cvterm will be missing in the mapping section. The footer is just to enrich the bincodes with further properties (so not essential). Please check if the bincode of the line that creates the error is present in the header.

PS: If the output format of mapman did indeed change would you be willing to assist in extending the importer to accept the new format as well?