Verify GENIA corpus is good for training

juanmirocks commented 9 years ago

[x] Identify correct GENIA version --> We use the JNLPBA
[x] Identify whether they do make the distinction between protein vs gene names as different classes
[x] Identity what DNA really means and what protein really means --> (See tables below). There is clear distinction between the two

ashishbaghudana commented 9 years ago

Identify correct GENIA version

Intending to use JNLPBA Corpus.
Used for the Bio-Entity Recognition Task at BioNLP/NLPBA 2004

Distinction between protein and gene-name

Training set has 2000 MEDLINE abstracts annotated over 5 different classes - protein, DNA, RNA, cell line and cell type.
Test set has 404 MEDLINE abstracts
Represented using the IOB2 notation.

Current Software Tools

Open source software called GIMLI (http://bioinformatics.ua.pt/software/gimli/) achieved an F-measure of 73.05% on the JNLPBA Corpus

ashishbaghudana commented 9 years ago

Statistics with Gimli

	Protein	DNA	RNA	Cell Type	Cell line	Overall
Precision	72,52	74,95	70,34	82,08	63,91	73,96
Recall	78,33	65,44	70,34	63,20	58,80	72,17
F-measure	75,31	69,87	70,34	71,41	61,25	73,05

juanmirocks commented 9 years ago

Seems promising. Please, list a few sample annotations of protein & DNA to make sure we understand what's annotated.

ashishbaghudana commented 9 years ago

Example Recognition of Protein and DNA in the same sentence.

IL-2 gene expression annotated as a DNA fragment (in some sense, equivalent to gene name)
NF-Kappa B, CD28 and 5-lipoxygenase annotated as proteins.

For reference, the sentence is:

IL-2 gene expression and NF-kappa B activation through CD-28 requires reactive oxygen production by 5-lipoxygenase.

Word	Tag
IL-2	B-DNA
gene	I-DNA
expression	O
and	O
NF-kappa	B-protein
B	I-protein
activation	O
through	O
CD28	B-protein
requires	O
reactive	O
oxygen	O
production	O
by	O
5-lipoxygenase	B-protein
.	O

ashishbaghudana commented 9 years ago

Example of Protein DNA Interaction

IL-2 is tagged as protein
78-nucleotide segment and major transcription start site are annotated separately as DNA fragments. (In my opinion, the relationship is between IL-2 and a 78-nt segment 1.3 kilobases before the transcription start site. i.e. there shouldn't be two fragments, but just one.)

For reference, the sentence is:

IL-2 responsiveness, on the other hand, depends on a 78-nucleotide segment 1.3 kilobases upstream of the major transcription start site.

Word	Tag
IL-2	B-protein
responsiveness	O
,	O
on	O
the	O
other	O
hand	O
,	O
depends	O
on	O
a	O
78-nucleotide	B-DNA
segment	I-DNA
1.3	O
kilobases	O
upstream	O
of	O
the	O
major	B-DNA
transcription	I-DNA
start	I-DNA
site	I-DNA
.	O

ashishbaghudana commented 9 years ago

Example of Protein RNA Interaction

IL-2R alpha mRNA annotated as RNA.
IL-1 and NF-kappa B annotated as protein.

For reference, the sentence is:

IL-1 induces a rapid, protein synthesis-independent appearance of IL-2R alpha mRNA that is blocked by inhibitors of NF-kappa B activation.

Word	Tag
IL-1	B-protein
induces	O
a	O
rapid	O
,	O
protein	O
synthesis-independent	O
appearance	O
of	O
IL-2R	B-RNA
alpha	I-RNA
mRNA	I-RNA
that	O
is	O
blocked	O
by	O
inhibitors	O
of	O
NF-kappa	B-protein
B	I-protein
activation	O
.	O

juanmirocks commented 9 years ago

Great,

please compile & summarize the samples.

A list with just protein names. As in: A, B, C, D, … A list with just DNA names. A list with just RNA names. A list with protein <-> DNA interactions A list with protein <-> RNA interactions

As for the interactions lists, I suggest you markdown-style the mentioned entities (with strong or code) to better visualize them

Let’s aim for +10 items for the names lists, ~5 for the interactions

On Mon, Jun 22, 2015 at 12:02 PM Ashish Baghudana notifications@github.com wrote:

Example of Protein RNA Interaction

IL-2R alpha mRNA annotated as RNA.

IL-1 and NF-kappa B annotated as protein.

For reference, the sentence is:

IL-1 induces a rapid, protein synthesis-independent appearance of IL-2R alpha mRNA that is blocked by inhibitors of NF-kappa B activation. Word Tag IL-1 B-protein induces O a O rapid O , O protein O synthesis-independent O appearance O of O IL-2R B-RNA alpha I-RNA mRNA I-RNA that O is O blocked O by O inhibitors O of O NF-kappa B-protein B I-protein activation O . O

— Reply to this email directly or view it on GitHub https://github.com/ashishbaghudana/biomedical-text-mining/issues/1#issuecomment-114056290 .

ashishbaghudana commented 9 years ago

Question 1: Do you wish that I manually extract names of proteins, DNA and RNA from the JNLPBA dataset?

However, I wrote a script to extract all Proteins, DNA and RNA phrases from the training data. I've put them up here. Let me know if that's what you're looking for.

Question 2: What do you mean by a list of Protein <=> DNA / RNA interactions? This data is not part of the JNLPBA dataset. Like the previous one, should I manually annotate interactions from these sentences?

ashishbaghudana commented 9 years ago

DNA:

The purified NF-GM2 consists of 50 (p50) and 65 (p65) kDa polypeptides and has a binding activity specific both both the GM-CSF and immunoglobulin kappa (GGAAAGTCCC) enhancers.
Identification of a human LIM-Hox gene, hLH-2, aberrantly expressed in chronic myelogenous leukaemia and located on 9q33-34.1.
By screening a T cell cDNA library, we identified a novel ets transcription factor that binds RBTN-2.
We find that mutation of these elements, and particularly the GATA elements, results in a decrease or complete loss of DNase I hypersensitivity.
Subsequently, a 1.8 kilobase (kb) fragment of the CIITA promoter was isolated and sequenced.
Although p45 mRNA is transcribed from two different promoters, aNF-E2 promoter and fNF-E2 promoter, in erythroid and megakaryocytic lineage cells, p45 mRNA is transcribed only from aNF-E2 promoter.

RNA:

Abundant expression of erythroid transcription factor P45 NF-E2 mRNA in human peripheral granurocytes.
Although p45 mRNA is transcribed from two different promoters, aNF-E2 promoter and fNF-E2 promoter, in erythroid and megakaryocytic lineage cells, p45 mRNA is transcribed only from aNF-E2 promoter.
HL-60 cells were found to express mafK mRNA, indicating the presence of genuine NF-E2 complex in the cells.
The helper activity for IgE synthesis by the CD27/CD70 interaction did not contribute to the enhancement of germline epsilon transcripts.
The tissular patterns of CIITA and MHC class II gene expression are tightly correlated: CIITA mRNA is highly expressed in B cells, and is induced by interferon gamma (IFN-gamma) in macrophage and epithelial cell lines.
CNI-1493 blocked neither the lipopolysaccharide (LPS) -induced increases in the expression of the TNF mRNA nor the translocation of nuclear factor NF-kappa B to the nucleus in macrophages activated by 15 min of LPS stimulation, indicating that CNI-1493 does not interfere with early NF-kappa B-mediated transcriptional regulation of TNF.

Proteins:

Despite these findings, the molecular mechaniscms by which Ets and NF-KappaB/NFAT proteins cooperatively regulate inducible T-cell gene expression remained unknown.
Moreover, the ACH-2 cells treated with HOCl or H2O2 released tumor necrosis factor-alpha (TNF-alpha) in the supernatants.
Intrafollicular CD57+ cells did not stain for Bcl-6, and were also depleted in AITL/GC.
First, we showed that viral binding induced a number of immunoregulatory genes (IL-1beta, A20, NF-kappaB-p105/p50, and IkappaBalpha) in unactivated monocytes and that neutralizing Abs to the major HCMV glycproteins, gB(UL75), inhibited the induction of these genes.
Electrophoretically purified p50 alone can form a protein-DNA complex, but in the mixture, p50 associated preferentially with p65 to form the NF-GM2 complex.

Summary: Tabulated Names of Proteins, DNA and RNA

Proteins	DNA	RNA
Ets and NF-KappaB/NFAT proteins	immunoglobulin kappa (GGAAAGTCCC) enhancers	erythroid transcription factor P45 NF-E2 mRNA
tumor necrosis factor-alpha (TNF-alpha)	human LIM-Hox gene	p45 mRNA
Bcl-6	hLH-2	mafK mRNA
IL-1beta	9q33-34.1	germline epsilon transcripts
A20	cDNA library	CIITA mRNA
NF-kappaB-p105/p50	GATA elements	TNF mRNA
IkappaBalpha	1.8 kilobase (kb) fragment
HCMV glycproteins	CIITA promoter
gB	aNF-E2 promoter
UL75	fNF-E2 promoter

Observations

RNA names seem to follow a trend of [w]+ mRNA
More often than not, DNA names tend to end with
- gene
- fragment
- enhancers
- promoters
These two rules might simplify and increase accuracy of recognition of both DNA and RNA fragments.
Proteins on the other hand are very diverse, and recognizing protein entities would need more training.

juanmirocks commented 9 years ago

Seems promising. OK, I misunderstood at the beginning and thought that the relation annotations were also marked. So the sampled you put here are enough. Very good compilation.

I don't think we should focus on the recognition of proteins; that has been done in the past. We can use existing methods, even if we have to retrain them with this data.

Therefore, it seems that our focus will be:

Recognition of DNA (fragments)
Recognition of mRNA (fragments)
Manual annotation of existing corpus to mark the relations. You agree on this one?
Method for the relationship extraction

ashishbaghudana commented 9 years ago

Yes, agreed. So, I've been trying to get Gimli to work and I'm currently experiencing some problems with it. Hopefully I should be able to get that to run. Gimli helps us identify proteins, DNA and RNA fragments.

However, we will need to work with annotation of existing corpus to mark relations. One possible dataset is http://genome.jouy.inra.fr/texte/LLLchallenge/#training_download. However, I am getting a 503 when I attempt to download the dataset. The dataset description is as follows:

EDIT: The datasets are uploaded in resources/corpora/interaction

The LLL05 challenge task corpus consists of a training and test set (55+25/86 sentences, respectively) with annotations for protein/gene interactions. There are two different versions of the corpus, a basic and an enriched data set. The enriched data set contains further linguistic information (lemmas and syntactic dependencies). The training corpus is organized in two separate parts, one containing only 'simple' sentences, the other including coreferences and ellipis. The LLL05 corpus distinguishes between agents and targets for each relation. In addition, the different types of relations are grouped (explicit action, protein-gene promotor binding, regulon family membership).

Furthermore, this is an example from their dataset:

Example
ID 11011148-1
sentence ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.
words word(0,'ykuD',0,3) word(1,'was',5,7) word(2,'transcribed',9,19) word(3,'by',21,22) word(4,'SigK',24,27) word(5,'RNA',29,31) word(6,'polymerase',33,42) word(7,'from',44,47) word(8,'T4',49,50) word(9,'of',52,53) word(10,'sporulation',55,65)
agents agent(4)
targets target(0)
genic_interactions genic_interaction(4,0)

ashishbaghudana / mthesis-ashish