ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

NEXML OTU format #23

Open petermr opened 9 years ago

petermr commented 9 years ago

The primary output of ami-phylo is NEXML and the current output validates against nexml.org.

I propose we use NEXML to aggregate per-OTU logging information.

   read tree format (e.g. ijsem.xml) 
     and generate regexes (level0= detect, level1=correct) and actions (abort, record error, etc.)
   Run HOCR
   Run diagramanalyzer 
   merge to identify tips (else we analyze other non-tip text)
   foreach tip {
      check text against ijsem.xml (detect)
      if (ok) {
         create tip label in extended NEXML
      } else {
         correct text against level1
         if (ok) create tip, with edit record
      }
      if (!ok(tip)) {
         action(tip)
      }
   }

The proposed extension will be something like

original:
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <otus label="RootTaxaBlock">

   <otu id="otu7">Jonquetella anthropi E3_33 (EU840722)</otu>
...
</otus>
</nexml>
...
new: 
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:cm="http://www.contentmine.org/ami-phylo">
 <otus label="RootTaxaBlock">

<otu id="otu7" cm:genus="Jonquetella" cm:species="anthropi" cm:strain="E3_33" cm:ena="EU840722">Jonquetella anthropi E3_33 EU840722</otu>
...
</otus>
</nexml>

This introduces a new namespace (for contentmine) and allows us to annotate without crashing. Normal NEXML parsers will ignore our new attributes. It makes it easy to extract information using XPath, e.g. search for all genus except Homo:

nexml//otu/@cm:genus(not(.='Homo'))

(there's a bit of XML namespace stuff to be added). This makes the search more precise than a contextless grep for example. (more on garbles follows)

petermr commented 9 years ago

Garbles and their corrections can be treated as follows:

original:

<otu id="otu17">Ch/orobium tepidum TLST (NC_OO2932)</otu>

Note this has three errors: we can correct to:

<otu id="otu17"
  cm:genus="Chlorobium"
  cm:genusEdit="slash2l"
  cm:species="tepidum"
  cm:strain="TLST"
  cm:ncid="NC_002932"
  cm:ncidEdit="o2O o2O"
>Ch/orobium tepidum TLST (NC_OO2932)</otu>

In this the orginal (the "surface") is preserved and the corrected values are in accessible form. We can count the edits (1 species and 2 ncid). This allows us to revise our strategy later (e.g. if we don't trust one of the edits or to add new ones).

Assuming nexml.org is happy with the NEXML then I shall go with this.

petermr commented 9 years ago

Have submitted simple annotated NEXML to nexml.org and get JSON

"nexml": {
    "otus": 
{
    "otu": 
[
{
    "$t": "Bacillus subtilis 168 (NC_00964)",
    "@cm:species": "subtilis",
    "@id": "otu1"
},

The tool has added the attribute without complaining, so we'll go with that. Also submitted an invalid XML file (without the xmlns:cm declaration) and got:

:4: namespace error : Namespace prefix cm for species on otu is not defined <otu id="otu1" cm:species="subtilis">Bacillus subtilis 168 (NC_00964)</otu> ^ 

which is exactly how it should be.