dictyBase / Modware-Loader

Various data munging and loading scripts for genome database
2 stars 1 forks source link

Line break for only first few words (xml -> mediawiki) #104

Closed ypandit closed 10 years ago

ypandit commented 10 years ago

Input: A summary paragraph (XML) like below

<summary paragraph_no="3862">DNA-dependent RNA polymerase catalyzes the transcription of DNA into RNA using the four ribonucleoside triphosphates as substrates (<a href="/ontology/go/0003899/annotation/page/1">DNA-directed RNA polymerase activity</a>). Three distinct zinc-containing RNA polymerases are found in eukaryotic nuclei: <a href="/gene/DDB_G0275759">rpa1</a> for the ribosomal RNA precursor (<a href="/ontology/go/0006360/annotation/page/1">transcription from RNA polymerase I promoter</a>), (<a href="/ontology/go/0005736/annotation/page/1">DNA-directed RNA polymerase I complex</a>), <a href="/gene/DDB_G0279193">rpb1</a> for the mRNA precursor (<a href="/ontology/go/0006366/annotation/page/1">transcription from RNA polymerase II promoter</a>), (<a href="/ontology/go/0005665/annotation/page/1">DNA-directed RNA polymerase II, core complex</a>), and <a href="/gene/DDB_G0277199">rpc1</a> for 5S and tRNA genes (<a href="/ontology/go/0006383/annotation/page/1">transcription from RNA polymerase III promoter</a>), (<a href="/ontology/go/0005666/annotation/page/1">DNA-directed RNA polymerase III complex</a>).<br/>RNA polymerase II is composed of 12 subunits: <a href="/gene/DDB_G0279193">rpb1</a>, <a href="/gene/DDB_G0288257">rpb2</a>, <a href="/gene/DDB_G0292244">rpb3</a>, <a href="/gene/DDB_G0282739">rpb4</a>, <a href="/gene/DDB_G0291636">rpb5</a>, <a href="/gene/DDB_G0291037">rpb6</a>, <a href="/gene/DDB_G0284891">rpb7</a>, <a href="/gene/DDB_G0278039">rpb8</a>, <a href="/gene/DDB_G0268306">rpb9</a>, <a href="/gene/DDB_G0272036">rpb10</a>, <a href="/gene/DDB_G0277677">rpb11</a>, and<a href="/gene/DDB_G0283365">rpb12</a>. Subunits 5, 6, 8, 10, and 12 are common to all threeRNA polymerases (reviewed by <a href="http://www.ncbi.nlm.nih.gov/pubmed/9618449">Hampsey</a> in 1998). <br/><curation_status>Gene has been comprehensively annotated, 09-APR-2004 PG</curation_status></summary>

Output: MediaWiki (converted using HTML::WikiConverter)

DDB_G0268306    Pascale Gaudet  [/ontology/go/0003899/annotation/page/1 DNA-directed RNA polymerase activity]). Three distinct zinc-containing RNA polymerases are found in eukaryotic nuclei: [/gene/DDB_G0275759 rpa1] for the ribosomal RNA precursor ([/ontology/go/0006360/annotation/page/1 transcription from RNA polymerase I promoter]), ([/ontology/go/0005736/annotation/page/1 DNA-directed RNA polymerase I complex]), [/gene/DDB_G0279193 rpb1] for the mRNA precursor ([/ontology/go/0006366/annotation/page/1 transcription from RNA polymerase II promoter]), ([/ontology/go/0005665/annotation/page/1 DNA-directed RNA polymerase II, core complex]), and [/gene/DDB_G0277199 rpc1] for 5S and tRNA genes ([/ontology/go/0006383/annotation/page/1 transcription from RNA polymerase III promoter]), ([/ontology/go/0005666/annotation/page/1 DNA-directed RNA polymerase III complex]).<br />RNA polymerase II is composed of 12 subunits: [/gene/DDB_G0279193 rpb1], [/gene/DDB_G0288257 rpb2], [/gene/DDB_G0292244 rpb3], [/gene/DDB_G0282739 rpb4], [/gene/DDB_G0291636 rpb5], [/gene/DDB_G0291037 rpb6], [/gene/DDB_G0284891 rpb7], [/gene/DDB_G0278039 rpb8], [/gene/DDB_G0268306 rpb9], [/gene/DDB_G0272036 rpb10], [/gene/DDB_G0277677 rpb11], and[/gene/DDB_G0283365 rpb12]. Subunits 5, 6, 8, 10, and 12 are common to all threeRNA polymerases (reviewed by [http://www.ncbi.nlm.nih.gov/pubmed/9618449 Hampsey] in 1998). <br />Gene has been comprehensively annotated, 09-APR-2004 PG

DNA-dependent RNA polymerase catalyzes the transcription of DNA into RNA using the four ribonucleoside triphosphates as substrates (

Note: _Ignore the first two columns of Output - DDBGID & Author Name

ypandit commented 10 years ago

Input:

<summary paragraph_no="3782">The <i>abpA</i> gene encodes alpha-actinin, a 95 kD protein originally isolated from the cell cortex. It forms a dimer, crosslinks actin filaments (<a href="/ontology/go/0051017/annotation/page/1">actin filament bundle formation</a>) into lateral arrays, and increases the actin-stimulated Mg ATPase of myosin. Both activities are regulated by Ca2+ <a href="http://www.ncbi.nlm.nih.gov/pubmed/6746725"> (Condeelis et al. 1984)</a>. The alpha-actinin molecule carries two characteristic EF-hand structures. The calcium-binding loops form the structural basis for the calcium sensitivity (<a href="/ontology/go/0005509/annotation/page/1">calcium ion binding</a>) <a href="http://www.ncbi.nlm.nih.gov/pubmed/3622778"> (Noegel et al. 1987)</a>. It was shown that EF hand I has a low affinity for Ca2+ and EF hand II a high affinity, implying a regulatory function of EF hand I in the inhibition of F-actin cross-linking activity  <a href="http://www.ncbi.nlm.nih.gov/pubmed/8486739"> (Witke et al. 1993)</a>. This was confirmed when the viscoelastic properties of F-actin solutions in the presence of mutated EF hand structures was tested, and it was concluded that the first EF hand of alpha-actinin is crucial for its crosslinking function <a href="http://www.ncbi.nlm.nih.gov/pubmed/8561496"> (Janssen et al. 1996)</a>.<br/>The phenotype of an alpha-actinin mutant, HG1130, which retained only trace amounts of alpha-actinin, appeared to be normal <a href="http://www.ncbi.nlm.nih.gov/pubmed/3956480"> (Wallraff et al. 1986)</a>. The lack of alpha-actinin was shown to only slightly affect cell substrate adhesion <a href="http://www.ncbi.nlm.nih.gov/pubmed/10411959"> (Weber 1996)</a>, growth and pinocytosis <a href="http://www.ncbi.nlm.nih.gov/pubmed/10413681"> (Rivero et al. 1999)</a>, and development <a href="http://www.ncbi.nlm.nih.gov/pubmed/10704840"> (Ponte et al. 2000)</a>. A redundancy in the microfilament system was suggested when two cytoskeletal genes, <i>abpA</i> and <a href="/gene/DDB_G0269100">abpC</a>, were disrupted. The loss of both proteins severely affected growth and resistance to osmotic stress. It also resulted in reduced motility and phagocytosis rates, and heavily impaired development. These changes were reversed by expressing either <i>abpA</i> or <i>abpC</i>, showing the defects to be due to the absence of the two F-actin cross-linking proteins <a href="http://www.ncbi.nlm.nih.gov/pubmed/1732064"> (Witke et al. 1992)</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/8937986"> (Rivero et al. 1996)</a>.<curation_status>Gene has been comprehensively annotated, 19-SEP-2003 PF</curation_status></summary>

Output:

DDB_G0268632    Petra Fey   ''abpA'' gene encodes alpha-actinin, a 95 kD protein originally isolated from the cell cortex. It forms a dimer, crosslinks actin filaments ([/ontology/go/0051017/annotation/page/1 actin filament bundle formation]) into lateral arrays, and increases the actin-stimulated Mg ATPase of myosin. Both activities are regulated by Ca2+ [http://www.ncbi.nlm.nih.gov/pubmed/6746725  (Condeelis et al. 1984)]. The alpha-actinin molecule carries two characteristic EF-hand structures. The calcium-binding loops form the structural basis for the calcium sensitivity ([/ontology/go/0005509/annotation/page/1 calcium ion binding]) [http://www.ncbi.nlm.nih.gov/pubmed/3622778  (Noegel et al. 1987)]. It was shown that EF hand I has a low affinity for Ca2+ and EF hand II a high affinity, implying a regulatory function of EF hand I in the inhibition of F-actin cross-linking activity [http://www.ncbi.nlm.nih.gov/pubmed/8486739  (Witke et al. 1993)]. This was confirmed when the viscoelastic properties of F-actin solutions in the presence of mutated EF hand structures was tested, and it was concluded that the first EF hand of alpha-actinin is crucial for its crosslinking function [http://www.ncbi.nlm.nih.gov/pubmed/8561496  (Janssen et al. 1996)].<br />The phenotype of an alpha-actinin mutant, HG1130, which retained only trace amounts of alpha-actinin, appeared to be normal [http://www.ncbi.nlm.nih.gov/pubmed/3956480  (Wallraff et al. 1986)]. The lack of alpha-actinin was shown to only slightly affect cell substrate adhesion [http://www.ncbi.nlm.nih.gov/pubmed/10411959  (Weber 1996)], growth and pinocytosis [http://www.ncbi.nlm.nih.gov/pubmed/10413681  (Rivero et al. 1999)], and development [http://www.ncbi.nlm.nih.gov/pubmed/10704840  (Ponte et al. 2000)]. A redundancy in the microfilament system was suggested when two cytoskeletal genes, ''abpA'' and [/gene/DDB_G0269100 abpC], were disrupted. The loss of both proteins severely affected growth and resistance to osmotic stress. It also resulted in reduced motility and phagocytosis rates, and heavily impaired development. These changes were reversed by expressing either ''abpA'' or ''abpC'', showing the defects to be due to the absence of the two F-actin cross-linking proteins [http://www.ncbi.nlm.nih.gov/pubmed/1732064  (Witke et al. 1992)], [http://www.ncbi.nlm.nih.gov/pubmed/8937986  (Rivero et al. 1996)].Gene has been comprehensively annotated, 19-SEP-2003 PF

The
ypandit commented 10 years ago

Input:

<summary paragraph_no="4162">The Roco family consists of multi-domain proteins that share three domains in common: the Roc domain (<u>R</u>as <u>o</u>f <u>c</u>omplex proteins), COR (<u>C</u>-terminal <u>o</u>f <u>R</u>oc), and a kinasedomain. Additionally, all Roco family members contain a leucine-rich repeat (LRR), with the exception of<a href="/gene/DDB_G0267472">roco7</a>.  Other domains found in Roco proteins include WD40 repeats, cNB/CNMP (cyclicnucleotide binding), PH (pleckstrin homology), and RGS (regulator of G protein signaling) domains (<a href="http://www.ncbi.nlm.nih.gov/pubmed/14654223">Bosgraaf and Van Haastert 2003</a>).  Eleven <i>Dictyostelium</i> proteins belong to the Roco family: <a href="/gene/DDB_G0291079">gbpC</a>, <a href="/gene/DDB_G0273259">qkgA-1</a>, <a href="/gene/DDB_G0269250">pats1</a>, <a href="/gene/DDB_G0288251">roco4</a>, <a href="/gene/DDB_G0294533">roco5</a>, <a href="/gene/DDB_G0279417">roco6</a>, <a href="/gene/DDB_G0267472">roco7</a>, <a href="/gene/DDB_G0286127">roco8</a>, <a href="/gene/DDB_G0288183">roco9</a>, <a href="/gene/DDB_G0291710">roco10</a>, and <a href="/gene/DDB_G0268636">roco11</a>.<br/>(<a href="http://www.ncbi.nlm.nih.gov/pubmed/20348387">van Egmond and van Haastert 2010</a>) identified developmental defects in roco4- cells  during the transition from mound to fruiting body; prestalk cells produce reduced levels of cellulose, leading to unstable stalks that are unable to properly lift the spore head. (<a href="http://www.ncbi.nlm.nih.gov/pubmed/22689969">Gilsbach, et al. 2013</a>) solved the structure of Roco4 kinase wild-type, Parkinson disease-related mutants G1179S and L1180T and the structure of Roco4 kinase in complex with the LRRK2 inhibitor H1152. Serine 1187 and serine 1189 were shown to be essential for kinase activity.<br/><curation_status>Gene has been comprehensively annotated, 15-SEP-2004 KP</curation_status></summary>

Output:

DDB_G0279417    Robert Dodson   <u>R</u>as <u>o</u>f <u>c</u>omplex proteins), COR (<u>C</u>-terminal <u>o</u>f <u>R</u>oc), and a kinasedomain. Additionally, all Roco family members contain a leucine-rich repeat (LRR), with the exception of[/gene/DDB_G0267472 roco7]. Other domains found in Roco proteins include WD40 repeats, cNB/CNMP (cyclicnucleotide binding), PH (pleckstrin homology), and RGS (regulator of G protein signaling) domains ([http://www.ncbi.nlm.nih.gov/pubmed/14654223 Bosgraaf and Van Haastert 2003]). Eleven ''Dictyostelium'' proteins belong to the Roco family: [/gene/DDB_G0291079 gbpC], [/gene/DDB_G0273259 qkgA-1], [/gene/DDB_G0269250 pats1], [/gene/DDB_G0288251 roco4], [/gene/DDB_G0294533 roco5], [/gene/DDB_G0279417 roco6], [/gene/DDB_G0267472 roco7], [/gene/DDB_G0286127 roco8], [/gene/DDB_G0288183 roco9], [/gene/DDB_G0291710 roco10], and [/gene/DDB_G0268636 roco11].<br />([http://www.ncbi.nlm.nih.gov/pubmed/20348387 van Egmond and van Haastert 2010]) identified developmental defects in roco4- cells during the transition from mound to fruiting body; prestalk cells produce reduced levels of cellulose, leading to unstable stalks that are unable to properly lift the spore head. ([http://www.ncbi.nlm.nih.gov/pubmed/22689969 Gilsbach, et al. 2013]) solved the structure of Roco4 kinase wild-type, Parkinson disease-related mutants G1179S and L1180T and the structure of Roco4 kinase in complex with the LRRK2 inhibitor H1152. Serine 1187 and serine 1189 were shown to be essential for kinase activity.<br />Gene has been comprehensively annotated, 15-SEP-2004 KP

The Roco family consists of multi-domain proteins that share three domains in common: the Roc domain (