ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

Detecting/correcting garbles in OCR for ami-phylo #20

Open petermr opened 9 years ago

petermr commented 9 years ago

The following regular expression/s should allow (a) validation and (b) controlled correction of errors in the creation of tip labels in ami-phylo

<patternList>
  <!-- 
  Pattern for extracting species and ID from Int. J. Syst. Evol. Microbiol. (IJSEM) publications
  ideal pattern is 
          genus   species  strain    id
  of form
          Abcdia foobarius AS013T (EO740822)
  there should be exactly 4 words (space-separated) .
  Any target *without* a single pair of balanced brackets is an absolute fail
  We expect single letter garbles (B->8, 0->O, 1->I, etc.) and unexpected whitespace 
  insertion or deletion (indel)

  Note:
    <space/> translates to \s+
     all fields are wrapped in capture brackets (...)
     and concatenated to a single regex  
   -->
  <pattern level="0">
    <!--  this regex enforces the ID patterns strictly.
    It will only fail when charcters are garbled to the same type, e.g.
    M->N , i->l, 3->8
    These are undetectable at this stage
     -->
    <possibleSpace/>
    <!--  genus started with an uppercase letter followed by either several lowercase letters
    or on/two lowercase letters followed by period (abbreviation) -->
    <field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z]?\.)">
    </field>
    <space/>
    <!-- species should be only 2 or more lowercase characters -->
    <field name="species" pattern="[a-z]{2,}(?:\u2019?)">
    </field>
    <space/>
    <field name="strain" pattern="[^\s\(]+">
    </field>
    <space/>
    <!-- ID has an alpha and numeric part EU840723 or AJ307974 or NC_002967 -->
    <!-- require but strip left bracket -->
    <field name="id0" pattern="(?:\()[A-Z]{1,2}|NC_">
    <!--  and right bracket -->
    <field name="id1" pattern="[0-9]{5,6}(?:\)">
    </field>
    <space/>
  </pattern>

  <pattern level="1">
    <!-- this regex allows for common garbles (detected as an error in 0) 
    and error correction by "safe" correction. The correction will generate a conformant 
    filed, but it may not be "correct". Each substitution has an error and can be logged.  -->
    <possibleSpace/>
    <field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z02S/]?\.)">
      <substitution name="zero2little_o_or_big_o" original="0" edited="[oO]"/>
      <substitution name="two2little_z_or_big_z" original="2" edited="[zZ]"/>
      <substitution name="big_s2little_s" original="S" edited="s"/>
      <substitution name="slash2little_l_or_big_i" original="/" edited="[lI]"/>
      <!-- edit more as we find them -->
    </field>
    <space/>
    <field name="species" pattern="[a-z/]+(?:\u2019?)">
      <substitution name="s_slash_c2lower_sic" original="s/c" edited="sic"/>
      <substitution name="c_slash_l2lower_cil" original="d/" edited="cil"/>
      <substitution name="k_slash_n2lower_kin" original="k/n" edited="kin"/>
      <substitution name="r_slash_o2lower_rio" original="r/o" edited="rio"/>
      <substitution name="zero2little_o" original="0" edited="o"/>
      <substitution name="big_s2little_s" original="S" edited="s"/>
      <substitution name="slash2lower_l" original="/" edited="l"/>
    </field>
    <space/>
    <field name="strain" pattern="[^\s\(]+">
    </field>
    <space/>
    <field name="id0" pattern="(?:\()[A-Z123580]{1,2}|NC_">
      <!-- big letters may be garbled to numbers -->
      <substitution name="zero2big_o" original="0" edited="O"/>
      <substitution name="one2big_i" original="1" edited="I"/>
      <substitution name="two2big_z" original="2" edited="Z"/>
      <substitution name="three2big_b" original="3" edited="B"/>
      <substitution name="five2big_s" original="5" edited="S"/>
      <substitution name="eight2big_b" original="8" edited="B"/>
    </field>
    <field name="id1" pattern="[0-9BIOSZ]{5,6}(?:\)">
      <!--  numbers may be garbled to big letters -->
      <substitution name="big_o2zero" original="O" edited="0"/>
      <substitution name="big_b2eight" original="B" edited="eight"/>
      <substitution name="big_i2one" original="I" edited="one"/>
      <substitution name="big_s2five" original="S" edited="5"/>
      <substitution name="big_z2two" original="Z" edited="2"/>
    </field>
    <possibleSpace/>
  </pattern>
</patternList>
petermr commented 9 years ago

Latest output:

Bacillus subtilis 168 (NC_00964) => Bacillus subtilis 168 (NC_ 00964)
Lactoba => null
Desulfo => null
Proprinogenum modestus DSM 2376T (AJ307978) => Proprinogenum modestus DSM 2376T (AJ 307978)
Clostridium botulinum serotype e (M94261) => Clostridium botulinum serotype e (M 94261)
Streptococcus gordonii CH1 (NC_OO9785) => Streptococcus gordonii CH1 (NC_ 009785)
Jonquetella anthropi E3_33 (EU840722) => Jonquetella anthropi E3_33 (EU 840722)
Pseudomonas aeruginosa PAO1 (NC_OO2516) => Pseudomonas aeruginosa PAO1 (NC_ 002516)
Thermotoga maritime MSBBT (NC_O00853) => Thermotoga maritime MSBBT (NC_ 000853)
?Synechococcus elongatus? PCC 6301 (NC_0O6576) => ?Synechococcus elongatus? PCC 6301 (NC_ 006576)
Mycobacterium tuberculosis H37Ra (NC_0O9525) => Mycobacterium tuberculosis H37Ra (NC_ 009525)
Ochrobactrum anthropi ATCC 49188T (NC_0O9667) => Ochrobactrum anthropi ATCC 49188T (NC_ 009667)
Fusobacterium nucleatum DSM 20482 (AJ307974) => Fusobacterium nucleatum DSM 20482 (AJ 307974)
Caulobacter crescentus CB15 (NC_0O2696) => Caulobacter crescentus CB15 (NC_ 002696)
Es => null
Borrelia burgdorferi B31T (NC_O01218) => Borrelia burgdorferi B31T (NC_ 001218)
Chlorobium tepidum TLST (NC_OO2932) => Chlorobium tepidum TLST (NC_ 002932)
Finegoldia magna ATCC 29328 (NC_010376) => Finegoldia magna ATCC 29328 (NC_ 010376)
Bordetella pertussis Tohama (NC_0O2929) => Bordetella pertussis Tohama (NC_ 002929)
Neisseria gonorrhoeae FA1090 (NC_002946) => Neisseria gonorrhoeae FA1090 (NC_ 002946)
Pyramidobacter piscolens W5455T (EU379932) => Pyramidobacter piscolens W5455T (EU 379932)
Haemophilus influenzae RdKW20 (U32697) => Haemophilus influenzae RdKW20 (U 32697)
 => null
Synergistes jonesii ATCC 49833T (EU840723) => Synergistes jonesii ATCC 49833T (EU 840723)
Optiutus terrae PBQO-1T (NC_010571) => Optiutus terrae PBQO-1T (NC_ 010571)
Porphyromonas gingivalis W83 (AEO15924) => Porphyromonas gingivalis W83 (AE 015924)
Bacteroides fragi/is ATCC 252857 (NC_OO3228) => Bacteroides fragilis ATCC 252857 (NC_ 003228)
?Aquifex aeolicus? VF5 (NC_000918) => ?Aquifex aeolicus? VF5 (NC_ 000918)
Bifidobacterium longum NCC2705 (NC_0O4307) => Bifidobacterium longum NCC2705 (NC_ 004307)
Rhodopirellula baltica SH 1T (NC_005027) => Rhodopirellula baltica SH 1T (NC_ 005027)
Mycoplasma pneumoniae M129 (NC_00O912) => Mycoplasma pneumoniae M129 (NC_ 000912)

I'll fix the spurious space in the ID in the result.

The substitutions should be relatively easy to customize and are then automatic.

petermr commented 9 years ago

The substitution script:

<editor>
   <patternList>
      <pattern level="0">
         <!-- this regex enforces the ID patterns strictly. It will only fail when 
            characters are garbled to the same type, e.g. M->N , i->l, 3->8 These are 
            undetectable at this stage -->
         <space count="*" />
         <!-- genus started with an uppercase letter followed by either several 
            lowercase letters or on/two lowercase letters followed by period (abbreviation) -->
         <field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z]{1,2}\.)">
         </field>
         <space count="+"/>
         <!-- species should be only 2 or more lowercase characters -->
         <field name="species" pattern="[a-z]{2,}(?:\u2019?)">
         </field>
         <space count="+"/>
         <field name="strain" pattern="[^\s\(]+">
         </field>
         <space count="+"/>
         <!-- ID has an alpha and numeric part EU840723 or AJ307974 or NC_002967 -->
         <!-- require but strip left bracket -->
         <field name="id0" pattern="(?:\()(?:[A-Z]{1,2}|NC_)">
         </field>
         <!-- and right bracket -->
         <field name="id1" pattern="[0-9]{5,6}(?:\))">
         </field>
         <space count="*" />
      </pattern>

      <pattern level="1">
         <!-- this regex allows for common garbles (detected as an error in 0) 
            and error correction by "safe" correction. The correction will generate a 
            conformant filed, but it may not be "correct". Each substitution has an error 
            and can be logged. -->
         <space count="*" />
         <field name="genus" pattern="(?:\u2018?)[A-Z](?:[a-z]{2,}|[a-z02S/]?\.)">
            <substitution name="zero2little_o_or_big_o" original="0" edited="[oO]" />
            <substitution name="two2little_z_or_big_z" original="2" edited="[zZ]" />
            <substitution name="slash2little_l_or_big_i" original="/" edited="[lI]" />
            <!-- edit more as we find them -->
         </field>
         <space count="+"/>
         <field name="species" pattern="[a-z/]+(?:\u2019?)">
            <substitution name="s_slash_c2lower_sic" original="s/c" edited="sic" />
            <substitution name="c_slash_l2lower_cil" original="d/" edited="cil" />
            <substitution name="k_slash_n2lower_kin" original="k/n" edited="kin" />
            <substitution name="r_slash_o2lower_rio" original="r/o" edited="rio" />
            <substitution name="zero2little_o" original="0" edited="o" />
            <substitution name="big_s2little_s" original="S" edited="s" />
            <substitution name="slash2lower_l" original="/" edited="l" />
         </field>
         <space count="+"/>
         <field name="strain" pattern="[^\s\(]+(?:\s+[^\s\(]+)?"/>
         <space count="+"/>
         <field name="id0" pattern="(?:\()(?:[A-Z123580]{1,2}|NC_)">
            <!-- big letters may be garbled to numbers -->
            <substitution name="zero2big_o" original="0" edited="O" />
            <substitution name="one2big_i" original="1" edited="I" />
            <substitution name="two2big_z" original="2" edited="Z" />
            <substitution name="three2big_b" original="3" edited="B" />
            <substitution name="five2big_s" original="5" edited="S" />
            <substitution name="eight2big_b" original="8" edited="B" />
         </field>
         <field name="id1" pattern="[0-9BIOSZ]{5,6}(?:\))">
            <!-- numbers may be garbled to big letters -->
            <substitution name="big_o2zero" original="O" edited="0" />
            <substitution name="big_b2eight" original="B" edited="eight" />
            <substitution name="big_i2one" original="I" edited="one" />
            <substitution name="big_s2five" original="S" edited="5" />
            <substitution name="big_z2two" original="Z" edited="2" />
         </field>
         <space count="*" />
      </pattern>
   </patternList>
</editor>
petermr commented 9 years ago

Latest output. Includes records of which strings were edited (this is accessible as a list). Each substitution is of the form old=>new (the strings can be any length and contain any characters). null means that a valid set of fields could not be created.

Bacillus subtilis 168 (NC_00964) =>  Bacillus subtilis 168(NC_00964)
Lactoba => null
Desulfo => null
Proprinogenum modestus DSM 2376T (AJ307978) =>  Proprinogenum modestus DSM 2376T(AJ307978)
Clostridium botulinum serotype e (M94261) =>  Clostridium botulinum serotype e(M94261)
Streptococcus gordonii CH1 (NC_OO9785) =>  Streptococcus gordonii CH1(NC_009785); [O=>0, O=>0]
Jonquetella anthropi E3_33 (EU840722) =>  Jonquetella anthropi E3_33(EU840722)
Pseudomonas aeruginosa PAO1 (NC_OO2516) =>  Pseudomonas aeruginosa PAO1(NC_002516); [O=>0, O=>0]
Thermotoga maritime MSBBT (NC_O00853) =>  Thermotoga maritime MSBBT(NC_000853); [O=>0]
?Synechococcus elongatus? PCC 6301 (NC_0O6576) =>  ?Synechococcus elongatus? PCC 6301(NC_006576); [O=>0]
Mycobacterium tuberculosis H37Ra (NC_0O9525) =>  Mycobacterium tuberculosis H37Ra(NC_009525); [O=>0]
Ochrobactrum anthropi ATCC 49188T (NC_0O9667) =>  Ochrobactrum anthropi ATCC 49188T(NC_009667); [O=>0]
Fusobacterium nucleatum DSM 20482 (AJ307974) =>  Fusobacterium nucleatum DSM 20482(AJ307974)
Caulobacter crescentus CB15 (NC_0O2696) =>  Caulobacter crescentus CB15(NC_002696); [O=>0]
Es => null
Borrelia burgdorferi B31T (NC_O01218) =>  Borrelia burgdorferi B31T(NC_001218); [O=>0]
Chlorobium tepidum TLST (NC_OO2932) =>  Chlorobium tepidum TLST(NC_002932); [O=>0, O=>0]
Finegoldia magna ATCC 29328 (NC_010376) =>  Finegoldia magna ATCC 29328(NC_010376)
Bordetella pertussis Tohama (NC_0O2929) =>  Bordetella pertussis Tohama(NC_002929); [O=>0]
Neisseria gonorrhoeae FA1090 (NC_002946) =>  Neisseria gonorrhoeae FA1090(NC_002946)
Pyramidobacter piscolens W5455T (EU379932) =>  Pyramidobacter piscolens W5455T(EU379932)
Haemophilus influenzae RdKW20 (U32697) =>  Haemophilus influenzae RdKW20(U32697)
 => null
Synergistes jonesii ATCC 49833T (EU840723) =>  Synergistes jonesii ATCC 49833T(EU840723)
Optiutus terrae PBQO-1T (NC_010571) =>  Optiutus terrae PBQO-1T(NC_010571)
Porphyromonas gingivalis W83 (AEO15924) =>  Porphyromonas gingivalis W83(AE015924); [O=>0]
Bacteroides fragi/is ATCC 252857 (NC_OO3228) =>  Bacteroides fragilis ATCC 252857(NC_003228); [/=>l, O=>0, O=>0]
?Aquifex aeolicus? VF5 (NC_000918) =>  ?Aquifex aeolicus? VF5(NC_000918)
Bifidobacterium longum NCC2705 (NC_0O4307) =>  Bifidobacterium longum NCC2705(NC_004307); [O=>0]
Rhodopirellula baltica SH 1T (NC_005027) =>  Rhodopirellula baltica SH 1T(NC_005027)
Mycoplasma pneumoniae M129 (NC_00O912) =>  Mycoplasma pneumoniae M129(NC_000912); [O=>0]