ashishbaghudana / mthesis-ashish

MIT License
0 stars 1 forks source link

Verify GENIA corpus is good for training #1

Closed juanmirocks closed 9 years ago

juanmirocks commented 9 years ago
ashishbaghudana commented 9 years ago

Identify correct GENIA version

Distinction between protein and gene-name

Current Software Tools

ashishbaghudana commented 9 years ago
Statistics with Gimli
Protein DNA RNA Cell Type Cell line Overall
Precision 72,52 74,95 70,34 82,08 63,91 73,96
Recall 78,33 65,44 70,34 63,20 58,80 72,17
F-measure 75,31 69,87 70,34 71,41 61,25 73,05
juanmirocks commented 9 years ago

Seems promising. Please, list a few sample annotations of protein & DNA to make sure we understand what's annotated.

ashishbaghudana commented 9 years ago

Example Recognition of Protein and DNA in the same sentence.

For reference, the sentence is:

IL-2 gene expression and NF-kappa B activation through CD-28 requires reactive oxygen production by 5-lipoxygenase.

Word Tag
IL-2 B-DNA
gene I-DNA
expression O
and O
NF-kappa B-protein
B I-protein
activation O
through O
CD28 B-protein
requires O
reactive O
oxygen O
production O
by O
5-lipoxygenase B-protein
. O
ashishbaghudana commented 9 years ago

Example of Protein DNA Interaction

For reference, the sentence is:

IL-2 responsiveness, on the other hand, depends on a 78-nucleotide segment 1.3 kilobases upstream of the major transcription start site.

Word Tag
IL-2 B-protein
responsiveness O
, O
on O
the O
other O
hand O
, O
depends O
on O
a O
78-nucleotide B-DNA
segment I-DNA
1.3 O
kilobases O
upstream O
of O
the O
major B-DNA
transcription I-DNA
start I-DNA
site I-DNA
. O
ashishbaghudana commented 9 years ago

Example of Protein RNA Interaction

For reference, the sentence is:

IL-1 induces a rapid, protein synthesis-independent appearance of IL-2R alpha mRNA that is blocked by inhibitors of NF-kappa B activation.

Word Tag
IL-1 B-protein
induces O
a O
rapid O
, O
protein O
synthesis-independent O
appearance O
of O
IL-2R B-RNA
alpha I-RNA
mRNA I-RNA
that O
is O
blocked O
by O
inhibitors O
of O
NF-kappa B-protein
B I-protein
activation O
. O
juanmirocks commented 9 years ago

Great,

please compile & summarize the samples.

A list with just protein names. As in: A, B, C, D, … A list with just DNA names. A list with just RNA names. A list with protein <-> DNA interactions A list with protein <-> RNA interactions

As for the interactions lists, I suggest you markdown-style the mentioned entities (with strong or code) to better visualize them

Let’s aim for +10 items for the names lists, ~5 for the interactions ​

On Mon, Jun 22, 2015 at 12:02 PM Ashish Baghudana notifications@github.com wrote:

Example of Protein RNA Interaction

  • IL-2R alpha mRNA annotated as RNA.
  • IL-1 and NF-kappa B annotated as protein.

For reference, the sentence is:

IL-1 induces a rapid, protein synthesis-independent appearance of IL-2R alpha mRNA that is blocked by inhibitors of NF-kappa B activation. Word Tag IL-1 B-protein induces O a O rapid O , O protein O synthesis-independent O appearance O of O IL-2R B-RNA alpha I-RNA mRNA I-RNA that O is O blocked O by O inhibitors O of O NF-kappa B-protein B I-protein activation O . O

— Reply to this email directly or view it on GitHub https://github.com/ashishbaghudana/biomedical-text-mining/issues/1#issuecomment-114056290 .

ashishbaghudana commented 9 years ago

Question 1: Do you wish that I manually extract names of proteins, DNA and RNA from the JNLPBA dataset?

However, I wrote a script to extract all Proteins, DNA and RNA phrases from the training data. I've put them up here. Let me know if that's what you're looking for.

Question 2: What do you mean by a list of Protein <=> DNA / RNA interactions? This data is not part of the JNLPBA dataset. Like the previous one, should I manually annotate interactions from these sentences?

ashishbaghudana commented 9 years ago
DNA:
  1. The purified NF-GM2 consists of 50 (p50) and 65 (p65) kDa polypeptides and has a binding activity specific both both the GM-CSF and immunoglobulin kappa (GGAAAGTCCC) enhancers.
  2. Identification of a human LIM-Hox gene, hLH-2, aberrantly expressed in chronic myelogenous leukaemia and located on 9q33-34.1.
  3. By screening a T cell cDNA library, we identified a novel ets transcription factor that binds RBTN-2.
  4. We find that mutation of these elements, and particularly the GATA elements, results in a decrease or complete loss of DNase I hypersensitivity.
  5. Subsequently, a 1.8 kilobase (kb) fragment of the CIITA promoter was isolated and sequenced.
  6. Although p45 mRNA is transcribed from two different promoters, aNF-E2 promoter and fNF-E2 promoter, in erythroid and megakaryocytic lineage cells, p45 mRNA is transcribed only from aNF-E2 promoter.
RNA:
  1. Abundant expression of erythroid transcription factor P45 NF-E2 mRNA in human peripheral granurocytes.
  2. Although p45 mRNA is transcribed from two different promoters, aNF-E2 promoter and fNF-E2 promoter, in erythroid and megakaryocytic lineage cells, p45 mRNA is transcribed only from aNF-E2 promoter.
  3. HL-60 cells were found to express mafK mRNA, indicating the presence of genuine NF-E2 complex in the cells.
  4. The helper activity for IgE synthesis by the CD27/CD70 interaction did not contribute to the enhancement of germline epsilon transcripts.
  5. The tissular patterns of CIITA and MHC class II gene expression are tightly correlated: CIITA mRNA is highly expressed in B cells, and is induced by interferon gamma (IFN-gamma) in macrophage and epithelial cell lines.
  6. CNI-1493 blocked neither the lipopolysaccharide (LPS) -induced increases in the expression of the TNF mRNA nor the translocation of nuclear factor NF-kappa B to the nucleus in macrophages activated by 15 min of LPS stimulation, indicating that CNI-1493 does not interfere with early NF-kappa B-mediated transcriptional regulation of TNF.
Proteins:
  1. Despite these findings, the molecular mechaniscms by which Ets and NF-KappaB/NFAT proteins cooperatively regulate inducible T-cell gene expression remained unknown.
  2. Moreover, the ACH-2 cells treated with HOCl or H2O2 released tumor necrosis factor-alpha (TNF-alpha) in the supernatants.
  3. Intrafollicular CD57+ cells did not stain for Bcl-6, and were also depleted in AITL/GC.
  4. First, we showed that viral binding induced a number of immunoregulatory genes (IL-1beta, A20, NF-kappaB-p105/p50, and IkappaBalpha) in unactivated monocytes and that neutralizing Abs to the major HCMV glycproteins, gB(UL75), inhibited the induction of these genes.
  5. Electrophoretically purified p50 alone can form a protein-DNA complex, but in the mixture, p50 associated preferentially with p65 to form the NF-GM2 complex.

Summary: Tabulated Names of Proteins, DNA and RNA


Proteins DNA RNA
Ets and NF-KappaB/NFAT proteins immunoglobulin kappa (GGAAAGTCCC) enhancers erythroid transcription factor P45 NF-E2 mRNA
tumor necrosis factor-alpha (TNF-alpha) human LIM-Hox gene p45 mRNA
Bcl-6 hLH-2 mafK mRNA
IL-1beta 9q33-34.1 germline epsilon transcripts
A20 cDNA library CIITA mRNA
NF-kappaB-p105/p50 GATA elements TNF mRNA
IkappaBalpha 1.8 kilobase (kb) fragment
HCMV glycproteins CIITA promoter
gB aNF-E2 promoter
UL75 fNF-E2 promoter

Observations


  1. RNA names seem to follow a trend of [w]+ mRNA
  2. More often than not, DNA names tend to end with
    • gene
    • fragment
    • enhancers
    • promoters
  3. These two rules might simplify and increase accuracy of recognition of both DNA and RNA fragments.
  4. Proteins on the other hand are very diverse, and recognizing protein entities would need more training.
juanmirocks commented 9 years ago

Seems promising. OK, I misunderstood at the beginning and thought that the relation annotations were also marked. So the sampled you put here are enough. Very good compilation.

I don't think we should focus on the recognition of proteins; that has been done in the past. We can use existing methods, even if we have to retrain them with this data.

Therefore, it seems that our focus will be:

ashishbaghudana commented 9 years ago

Yes, agreed. So, I've been trying to get Gimli to work and I'm currently experiencing some problems with it. Hopefully I should be able to get that to run. Gimli helps us identify proteins, DNA and RNA fragments.

However, we will need to work with annotation of existing corpus to mark relations. One possible dataset is http://genome.jouy.inra.fr/texte/LLLchallenge/#training_download. However, I am getting a 503 when I attempt to download the dataset. The dataset description is as follows:

EDIT: The datasets are uploaded in resources/corpora/interaction

The LLL05 challenge task corpus consists of a training and test set (55+25/86 sentences, respectively) with annotations for protein/gene interactions. There are two different versions of the corpus, a basic and an enriched data set. The enriched data set contains further linguistic information (lemmas and syntactic dependencies). The training corpus is organized in two separate parts, one containing only 'simple' sentences, the other including coreferences and ellipis. The LLL05 corpus distinguishes between agents and targets for each relation. In addition, the different types of relations are grouped (explicit action, protein-gene promotor binding, regulon family membership).

Furthermore, this is an example from their dataset:

Example
ID 11011148-1
sentence ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.
words word(0,'ykuD',0,3) word(1,'was',5,7) word(2,'transcribed',9,19) word(3,'by',21,22) word(4,'SigK',24,27) word(5,'RNA',29,31) word(6,'polymerase',33,42) word(7,'from',44,47) word(8,'T4',49,50) word(9,'of',52,53) word(10,'sporulation',55,65)
agents agent(4)
targets target(0)
genic_interactions genic_interaction(4,0)