Open petermr opened 9 years ago
Garbles and their corrections can be treated as follows:
original:
<otu id="otu17">Ch/orobium tepidum TLST (NC_OO2932)</otu>
Note this has three errors: we can correct to:
<otu id="otu17"
cm:genus="Chlorobium"
cm:genusEdit="slash2l"
cm:species="tepidum"
cm:strain="TLST"
cm:ncid="NC_002932"
cm:ncidEdit="o2O o2O"
>Ch/orobium tepidum TLST (NC_OO2932)</otu>
In this the orginal (the "surface") is preserved and the corrected values are in accessible form. We can count the edits (1 species
and 2 ncid
). This allows us to revise our strategy later (e.g. if we don't trust one of the edits or to add new ones).
Assuming nexml.org is happy with the NEXML then I shall go with this.
Have submitted simple annotated NEXML to nexml.org and get JSON
"nexml": {
"otus":
{
"otu":
[
{
"$t": "Bacillus subtilis 168 (NC_00964)",
"@cm:species": "subtilis",
"@id": "otu1"
},
The tool has added the attribute without complaining, so we'll go with that.
Also submitted an invalid XML file (without the xmlns:cm
declaration) and got:
:4: namespace error : Namespace prefix cm for species on otu is not defined <otu id="otu1" cm:species="subtilis">Bacillus subtilis 168 (NC_00964)</otu> ^
which is exactly how it should be.
The primary output of
ami-phylo
is NEXML and the current output validates against nexml.org.I propose we use NEXML to aggregate per-OTU logging information.
The proposed extension will be something like
This introduces a new namespace (for contentmine) and allows us to annotate without crashing. Normal NEXML parsers will ignore our new attributes. It makes it easy to extract information using XPath, e.g. search for all genus except
Homo
:(there's a bit of XML namespace stuff to be added). This makes the search more precise than a contextless
grep
for example. (more on garbles follows)