Open petermr opened 9 years ago
Latest output:
Bacillus subtilis 168 (NC_00964) => Bacillus subtilis 168 (NC_ 00964)
Lactoba => null
Desulfo => null
Proprinogenum modestus DSM 2376T (AJ307978) => Proprinogenum modestus DSM 2376T (AJ 307978)
Clostridium botulinum serotype e (M94261) => Clostridium botulinum serotype e (M 94261)
Streptococcus gordonii CH1 (NC_OO9785) => Streptococcus gordonii CH1 (NC_ 009785)
Jonquetella anthropi E3_33 (EU840722) => Jonquetella anthropi E3_33 (EU 840722)
Pseudomonas aeruginosa PAO1 (NC_OO2516) => Pseudomonas aeruginosa PAO1 (NC_ 002516)
Thermotoga maritime MSBBT (NC_O00853) => Thermotoga maritime MSBBT (NC_ 000853)
?Synechococcus elongatus? PCC 6301 (NC_0O6576) => ?Synechococcus elongatus? PCC 6301 (NC_ 006576)
Mycobacterium tuberculosis H37Ra (NC_0O9525) => Mycobacterium tuberculosis H37Ra (NC_ 009525)
Ochrobactrum anthropi ATCC 49188T (NC_0O9667) => Ochrobactrum anthropi ATCC 49188T (NC_ 009667)
Fusobacterium nucleatum DSM 20482 (AJ307974) => Fusobacterium nucleatum DSM 20482 (AJ 307974)
Caulobacter crescentus CB15 (NC_0O2696) => Caulobacter crescentus CB15 (NC_ 002696)
Es => null
Borrelia burgdorferi B31T (NC_O01218) => Borrelia burgdorferi B31T (NC_ 001218)
Chlorobium tepidum TLST (NC_OO2932) => Chlorobium tepidum TLST (NC_ 002932)
Finegoldia magna ATCC 29328 (NC_010376) => Finegoldia magna ATCC 29328 (NC_ 010376)
Bordetella pertussis Tohama (NC_0O2929) => Bordetella pertussis Tohama (NC_ 002929)
Neisseria gonorrhoeae FA1090 (NC_002946) => Neisseria gonorrhoeae FA1090 (NC_ 002946)
Pyramidobacter piscolens W5455T (EU379932) => Pyramidobacter piscolens W5455T (EU 379932)
Haemophilus influenzae RdKW20 (U32697) => Haemophilus influenzae RdKW20 (U 32697)
=> null
Synergistes jonesii ATCC 49833T (EU840723) => Synergistes jonesii ATCC 49833T (EU 840723)
Optiutus terrae PBQO-1T (NC_010571) => Optiutus terrae PBQO-1T (NC_ 010571)
Porphyromonas gingivalis W83 (AEO15924) => Porphyromonas gingivalis W83 (AE 015924)
Bacteroides fragi/is ATCC 252857 (NC_OO3228) => Bacteroides fragilis ATCC 252857 (NC_ 003228)
?Aquifex aeolicus? VF5 (NC_000918) => ?Aquifex aeolicus? VF5 (NC_ 000918)
Bifidobacterium longum NCC2705 (NC_0O4307) => Bifidobacterium longum NCC2705 (NC_ 004307)
Rhodopirellula baltica SH 1T (NC_005027) => Rhodopirellula baltica SH 1T (NC_ 005027)
Mycoplasma pneumoniae M129 (NC_00O912) => Mycoplasma pneumoniae M129 (NC_ 000912)
I'll fix the spurious space in the ID in the result.
The substitutions should be relatively easy to customize and are then automatic.
The substitution script:
<editor>
<patternList>
<pattern level="0">
<!-- this regex enforces the ID patterns strictly. It will only fail when
characters are garbled to the same type, e.g. M->N , i->l, 3->8 These are
undetectable at this stage -->
<space count="*" />
<!-- genus started with an uppercase letter followed by either several
lowercase letters or on/two lowercase letters followed by period (abbreviation) -->
<field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z]{1,2}\.)">
</field>
<space count="+"/>
<!-- species should be only 2 or more lowercase characters -->
<field name="species" pattern="[a-z]{2,}(?:\u2019?)">
</field>
<space count="+"/>
<field name="strain" pattern="[^\s\(]+">
</field>
<space count="+"/>
<!-- ID has an alpha and numeric part EU840723 or AJ307974 or NC_002967 -->
<!-- require but strip left bracket -->
<field name="id0" pattern="(?:\()(?:[A-Z]{1,2}|NC_)">
</field>
<!-- and right bracket -->
<field name="id1" pattern="[0-9]{5,6}(?:\))">
</field>
<space count="*" />
</pattern>
<pattern level="1">
<!-- this regex allows for common garbles (detected as an error in 0)
and error correction by "safe" correction. The correction will generate a
conformant filed, but it may not be "correct". Each substitution has an error
and can be logged. -->
<space count="*" />
<field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z02S/]?\.)">
<substitution name="zero2little_o_or_big_o" original="0" edited="[oO]" />
<substitution name="two2little_z_or_big_z" original="2" edited="[zZ]" />
<substitution name="slash2little_l_or_big_i" original="/" edited="[lI]" />
<!-- edit more as we find them -->
</field>
<space count="+"/>
<field name="species" pattern="[a-z/]+(?:\u2019?)">
<substitution name="s_slash_c2lower_sic" original="s/c" edited="sic" />
<substitution name="c_slash_l2lower_cil" original="d/" edited="cil" />
<substitution name="k_slash_n2lower_kin" original="k/n" edited="kin" />
<substitution name="r_slash_o2lower_rio" original="r/o" edited="rio" />
<substitution name="zero2little_o" original="0" edited="o" />
<substitution name="big_s2little_s" original="S" edited="s" />
<substitution name="slash2lower_l" original="/" edited="l" />
</field>
<space count="+"/>
<field name="strain" pattern="[^\s\(]+(?:\s+[^\s\(]+)?"/>
<space count="+"/>
<field name="id0" pattern="(?:\()(?:[A-Z123580]{1,2}|NC_)">
<!-- big letters may be garbled to numbers -->
<substitution name="zero2big_o" original="0" edited="O" />
<substitution name="one2big_i" original="1" edited="I" />
<substitution name="two2big_z" original="2" edited="Z" />
<substitution name="three2big_b" original="3" edited="B" />
<substitution name="five2big_s" original="5" edited="S" />
<substitution name="eight2big_b" original="8" edited="B" />
</field>
<field name="id1" pattern="[0-9BIOSZ]{5,6}(?:\))">
<!-- numbers may be garbled to big letters -->
<substitution name="big_o2zero" original="O" edited="0" />
<substitution name="big_b2eight" original="B" edited="eight" />
<substitution name="big_i2one" original="I" edited="one" />
<substitution name="big_s2five" original="S" edited="5" />
<substitution name="big_z2two" original="Z" edited="2" />
</field>
<space count="*" />
</pattern>
</patternList>
</editor>
Latest output. Includes records of which strings were edited (this is accessible as a list). Each substitution is of the form old=>new
(the strings can be any length and contain any characters).
null
means that a valid set of fields could not be created.
Bacillus subtilis 168 (NC_00964) => Bacillus subtilis 168(NC_00964)
Lactoba => null
Desulfo => null
Proprinogenum modestus DSM 2376T (AJ307978) => Proprinogenum modestus DSM 2376T(AJ307978)
Clostridium botulinum serotype e (M94261) => Clostridium botulinum serotype e(M94261)
Streptococcus gordonii CH1 (NC_OO9785) => Streptococcus gordonii CH1(NC_009785); [O=>0, O=>0]
Jonquetella anthropi E3_33 (EU840722) => Jonquetella anthropi E3_33(EU840722)
Pseudomonas aeruginosa PAO1 (NC_OO2516) => Pseudomonas aeruginosa PAO1(NC_002516); [O=>0, O=>0]
Thermotoga maritime MSBBT (NC_O00853) => Thermotoga maritime MSBBT(NC_000853); [O=>0]
?Synechococcus elongatus? PCC 6301 (NC_0O6576) => ?Synechococcus elongatus? PCC 6301(NC_006576); [O=>0]
Mycobacterium tuberculosis H37Ra (NC_0O9525) => Mycobacterium tuberculosis H37Ra(NC_009525); [O=>0]
Ochrobactrum anthropi ATCC 49188T (NC_0O9667) => Ochrobactrum anthropi ATCC 49188T(NC_009667); [O=>0]
Fusobacterium nucleatum DSM 20482 (AJ307974) => Fusobacterium nucleatum DSM 20482(AJ307974)
Caulobacter crescentus CB15 (NC_0O2696) => Caulobacter crescentus CB15(NC_002696); [O=>0]
Es => null
Borrelia burgdorferi B31T (NC_O01218) => Borrelia burgdorferi B31T(NC_001218); [O=>0]
Chlorobium tepidum TLST (NC_OO2932) => Chlorobium tepidum TLST(NC_002932); [O=>0, O=>0]
Finegoldia magna ATCC 29328 (NC_010376) => Finegoldia magna ATCC 29328(NC_010376)
Bordetella pertussis Tohama (NC_0O2929) => Bordetella pertussis Tohama(NC_002929); [O=>0]
Neisseria gonorrhoeae FA1090 (NC_002946) => Neisseria gonorrhoeae FA1090(NC_002946)
Pyramidobacter piscolens W5455T (EU379932) => Pyramidobacter piscolens W5455T(EU379932)
Haemophilus influenzae RdKW20 (U32697) => Haemophilus influenzae RdKW20(U32697)
=> null
Synergistes jonesii ATCC 49833T (EU840723) => Synergistes jonesii ATCC 49833T(EU840723)
Optiutus terrae PBQO-1T (NC_010571) => Optiutus terrae PBQO-1T(NC_010571)
Porphyromonas gingivalis W83 (AEO15924) => Porphyromonas gingivalis W83(AE015924); [O=>0]
Bacteroides fragi/is ATCC 252857 (NC_OO3228) => Bacteroides fragilis ATCC 252857(NC_003228); [/=>l, O=>0, O=>0]
?Aquifex aeolicus? VF5 (NC_000918) => ?Aquifex aeolicus? VF5(NC_000918)
Bifidobacterium longum NCC2705 (NC_0O4307) => Bifidobacterium longum NCC2705(NC_004307); [O=>0]
Rhodopirellula baltica SH 1T (NC_005027) => Rhodopirellula baltica SH 1T(NC_005027)
Mycoplasma pneumoniae M129 (NC_00O912) => Mycoplasma pneumoniae M129(NC_000912); [O=>0]
The following regular expression/s should allow (a) validation and (b) controlled correction of errors in the creation of tip labels in
ami-phylo