MICommunity / psimi

Automatically exported from code.google.com/p/psimi
Creative Commons Attribution 4.0 International
5 stars 3 forks source link

Tab2Xml and/or Xml2Tab converters replace/delete MITAB columns #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Read in a mitab line via PsimiTabReader
2. Convert the Collection of BinaryInteraction to an EntrySet using the
Tab2Xml converter 
3. Convert the EntrySet back to a Collection of BinaryInteractions using
the Xml2Tab converter
4. Convert the BinaryInteractions to a MITAB string and compare to the
original.

Following a little code snippet that can reproduce the bug: 

String mitab = "your mitab line here"
PsimiTabReader reader = new PsimiTabReader(false);
Collection<BinaryInteraction> binaryInteractions = reader.read(mitab);
Tab2Xml tab2xml = new Tab2Xml();
entrySet = tab2xml.convert(binaryInteractions);
Xml2Tab xml2tab = new Xml2Tab();
Collection<BinaryInteraction> binaryInteractionsConverted =
xml2tab.convert(entrySet);
String mitabOriginal = createMitabResults((List<BinaryInteraction>)
binaryInteractions);
String mitabConverted = createMitabResults((List<BinaryInteraction>)
binaryInteractionsConverted);
System.out.println("MITAB original : " + mitabOriginal);
System.out.println("MITAB converted: " + mitabConverted);

For completion, the createMitabResults:

protected String createMitabResults(List<BinaryInteraction>
binaryInteractions) {
    MitabDocumentDefinition docDef = new MitabDocumentDefinition();
    StringBuilder sb = new StringBuilder(binaryInteractions.size() * 512);
    for (BinaryInteraction binaryInteraction : binaryInteractions) {
        String binaryInteractionString =
docDef.interactionToString(binaryInteraction);
        sb.append(binaryInteractionString);
        sb.append(NEW_LINE);
    }
    return sb.toString();
}

What is the expected output? What do you see instead?

Here an example with a MITAB line from MINT

MITAB original:
uniprotkb:O60828       uniprotkb:Q9Y2W2        -       -       uniprotkb:38
kDa nuclear protein containing a WW domain(gene name
synonym)|uniprotkb:Polyglutamine tract-binding
protein 1(gene name synonym)|uniprotkb:Npw38(gene name
synonym)|uniprotkb:PQBP-1(gene name synonym)|uniprotkb:JM26(orf
name)|uniprotkb:NPW38(gene name synonym)|uniprotkb:PQBP1(gene name)     
uniprotkb:SH3 domain-binding protein SNP70(gene name
synonym)|uniprotkb:Npw38-binding protein(gene name
synonym)|uniprotkb:SNP70(gene name synonym)|uniprotkb:NPWBP(gene name
synonym)|uniprotkb:WBP11(gene name) psi-mi:"MI:0018"(two hybrid)    -     
 pubmed:16713569 taxid:9606(Homo sapiens)    taxid:9606(Homo sapiens)   
psi-mi:"MI:0915"(physical association)  psi-mi:"MI:0471"(mint) 
mint:MINT-2873564        mint-score:0.36|homomint-score:0.375

MITAB converted: 
uniprotkb:O60828       uniprotkb:Q9Y2W2        uniprotkb:PQBP1(gene name) 
    uniprotkb:WBP11(gene name)      uniprotkb:38 kDa nuclear protein
containing a WW domain(gene name synonym)|uniprotkb:Polyglutamine
tract-binding protein 1(gene name synonym)|uniprotkb:Npw38(gene name
synonym)|uniprotkb:PQBP-1(gene name synonym)|intact:JM26(orf
name)|uniprotkb:NPW38(gene name synonym)|intact:O60828(shortLabel)  
uniprotkb:SH3 domain-binding protein SNP70(gene name
synonym)|uniprotkb:Npw38-binding protein(gene name
synonym)|uniprotkb:SNP70(gene name synonym)|uniprotkb:NPWBP(gene name
synonym)|intact:Q9Y2W2(shortLabel) MI:0018(two hybrid)     -      
pubmed:16713569 taxid:9606(Homo sapiens)        taxid:9606(Homo sapiens)  
     MI:0915(physical association)    unknown:European Bioinformatics
Institute(European Bioinformatics Institute)    -       -

In the 13th column, source db, the EBI is now listed, the original
interaction id (14th) and the confidence scores (15th) are missing.

Using a slight tweaked example from BioGRID (added something in column 14,
as the parser will fail otherwies), things get a little worse:

MITAB original : 
entrez gene/locuslink:3069     entrez gene/locuslink:11260     entrez
gene/locuslink:HDLBP     entrez gene/locuslink:XPOT      entrez
gene/locuslink:FLJ16432|entrez gene/locuslink:HBP|entrez
gene/locuslink:PRO2900|entrez gene/locuslink:VGL entrez gene/locuslink:XPO3
     psi-mi:"MI:0401"(biochemical)   Kruse C (2000)  pubmed:10657246
taxid:9606      taxid:9606      psi-mi:"MI:0914"(association)   
psi-mi:"MI:0463"(GRID)  db:id   -

MITAB converted: 
entrez gene/locuslink:3069     entrez gene/locuslink:11260     -       -  

intact:FLJ16432|intact:HBP|intact:PRO2900|intact:VGL|intact:3069(shortLabel) 
  intact:XPO3|intact:11260(shortLabel)     MI:0401(biochemical)    Kruse et
al     pubmed:10657246 taxid:9606(9606)    taxid:9606(9606)   
MI:0914(association)    unknown:European Bioinformatics Institute(European
Bioinformatics Institute)     -       -

In addition to the abovementioned changes in the columns 13,14 and 15,
several identifiers are now marked as intact, thus in the input they were not. 

What version of the product are you using? On what operating system?

Win XP Pro SP 3, PSI MI dependency
<dependency>
<groupId>psidev.psi.mi</groupId>
<artifactId>psimitab-search</artifactId>
<version>1.7.7-SNAPSHOT</version>
</dependency>

Please provide any additional information below.

Original issue reported on code.google.com by hagen.bl...@googlemail.com on 6 Oct 2009 at 8:32

GoogleCodeExporter commented 9 years ago

Original comment by brunoaranda on 6 Oct 2009 at 10:50

GoogleCodeExporter commented 9 years ago

Original comment by brunoaranda on 6 Oct 2009 at 12:00

GoogleCodeExporter commented 9 years ago
Hi,

I have done quite a few changes in the code to deal with the problems described.
However not everything could be addressed as some XML model limitations makes 
some
roundtrip TAB->XML->TAB conversions not possible. Issues addressed:

- Proper author conversion, including handling the publication year.
- Aliases do not have the db "intact" by default
- Source database properly converted. A "source reference" xref is created in 
the XML.
- Interaction AC can point to any database
- Confidence values are not ignored anymore.

However, there is a limitation converting aliases and alternative identifiers 
to XML
and back. In the XML, the alias fields do not contain the database they are 
refering
too. This is why we cannot know what db field to use when creating the alias 
fields
from the XML. Now, by default we are setting that to "unknown". Unless there are
changes in the XML model, or we add specific cases to the conversions (we could 
store
the database as attributes with some specific labels). TAB and XML were not 
design
with such a roundtrip case in mind and it is expected to some data to be lost 
in the
process.

Hope this addresses most of the issues in a satisfactory way,

Changes are now part of the 1.7.7-SNAPSHOT

Original comment by brunoaranda on 7 Oct 2009 at 9:07

GoogleCodeExporter commented 9 years ago

Original comment by brunoaranda on 7 Oct 2009 at 9:08

GoogleCodeExporter commented 9 years ago

Original comment by brunoaranda on 7 Oct 2009 at 9:08