Closed juanmirocks closed 9 years ago
Protein | DNA | RNA | Cell Type | Cell line | Overall | |
---|---|---|---|---|---|---|
Precision | 72,52 | 74,95 | 70,34 | 82,08 | 63,91 | 73,96 |
Recall | 78,33 | 65,44 | 70,34 | 63,20 | 58,80 | 72,17 |
F-measure | 75,31 | 69,87 | 70,34 | 71,41 | 61,25 | 73,05 |
Seems promising. Please, list a few sample annotations of protein & DNA to make sure we understand what's annotated.
IL-2 gene expression and NF-kappa B activation through CD-28 requires reactive oxygen production by 5-lipoxygenase.
Word | Tag |
---|---|
IL-2 | B-DNA |
gene | I-DNA |
expression | O |
and | O |
NF-kappa | B-protein |
B | I-protein |
activation | O |
through | O |
CD28 | B-protein |
requires | O |
reactive | O |
oxygen | O |
production | O |
by | O |
5-lipoxygenase | B-protein |
. | O |
IL-2 responsiveness, on the other hand, depends on a 78-nucleotide segment 1.3 kilobases upstream of the major transcription start site.
Word | Tag |
---|---|
IL-2 | B-protein |
responsiveness | O |
, | O |
on | O |
the | O |
other | O |
hand | O |
, | O |
depends | O |
on | O |
a | O |
78-nucleotide | B-DNA |
segment | I-DNA |
1.3 | O |
kilobases | O |
upstream | O |
of | O |
the | O |
major | B-DNA |
transcription | I-DNA |
start | I-DNA |
site | I-DNA |
. | O |
IL-1 induces a rapid, protein synthesis-independent appearance of IL-2R alpha mRNA that is blocked by inhibitors of NF-kappa B activation.
Word | Tag |
---|---|
IL-1 | B-protein |
induces | O |
a | O |
rapid | O |
, | O |
protein | O |
synthesis-independent | O |
appearance | O |
of | O |
IL-2R | B-RNA |
alpha | I-RNA |
mRNA | I-RNA |
that | O |
is | O |
blocked | O |
by | O |
inhibitors | O |
of | O |
NF-kappa | B-protein |
B | I-protein |
activation | O |
. | O |
Great,
please compile & summarize the samples.
A list with just protein names. As in: A, B, C, D, … A list with just DNA names. A list with just RNA names. A list with protein <-> DNA interactions A list with protein <-> RNA interactions
As for the interactions lists, I suggest you markdown-style the mentioned
entities (with strong or code
) to better visualize them
Let’s aim for +10 items for the names lists, ~5 for the interactions
On Mon, Jun 22, 2015 at 12:02 PM Ashish Baghudana notifications@github.com wrote:
Example of Protein RNA Interaction
- IL-2R alpha mRNA annotated as RNA.
- IL-1 and NF-kappa B annotated as protein.
For reference, the sentence is:
IL-1 induces a rapid, protein synthesis-independent appearance of IL-2R alpha mRNA that is blocked by inhibitors of NF-kappa B activation. Word Tag IL-1 B-protein induces O a O rapid O , O protein O synthesis-independent O appearance O of O IL-2R B-RNA alpha I-RNA mRNA I-RNA that O is O blocked O by O inhibitors O of O NF-kappa B-protein B I-protein activation O . O
— Reply to this email directly or view it on GitHub https://github.com/ashishbaghudana/biomedical-text-mining/issues/1#issuecomment-114056290 .
Question 1: Do you wish that I manually extract names of proteins, DNA and RNA from the JNLPBA dataset?
However, I wrote a script to extract all Proteins, DNA and RNA phrases from the training data. I've put them up here. Let me know if that's what you're looking for.
Question 2: What do you mean by a list of Protein <=> DNA / RNA interactions? This data is not part of the JNLPBA dataset. Like the previous one, should I manually annotate interactions from these sentences?
Proteins | DNA | RNA |
---|---|---|
Ets and NF-KappaB/NFAT proteins | immunoglobulin kappa (GGAAAGTCCC) enhancers | erythroid transcription factor P45 NF-E2 mRNA |
tumor necrosis factor-alpha (TNF-alpha) | human LIM-Hox gene | p45 mRNA |
Bcl-6 | hLH-2 | mafK mRNA |
IL-1beta | 9q33-34.1 | germline epsilon transcripts |
A20 | cDNA library | CIITA mRNA |
NF-kappaB-p105/p50 | GATA elements | TNF mRNA |
IkappaBalpha | 1.8 kilobase (kb) fragment | |
HCMV glycproteins | CIITA promoter | |
gB | aNF-E2 promoter | |
UL75 | fNF-E2 promoter |
[w]+ mRNA
Seems promising. OK, I misunderstood at the beginning and thought that the relation annotations were also marked. So the sampled you put here are enough. Very good compilation.
I don't think we should focus on the recognition of proteins; that has been done in the past. We can use existing methods, even if we have to retrain them with this data.
Therefore, it seems that our focus will be:
Yes, agreed. So, I've been trying to get Gimli to work and I'm currently experiencing some problems with it. Hopefully I should be able to get that to run. Gimli helps us identify proteins, DNA and RNA fragments.
However, we will need to work with annotation of existing corpus to mark relations. One possible dataset is http://genome.jouy.inra.fr/texte/LLLchallenge/#training_download. However, I am getting a 503 when I attempt to download the dataset. The dataset description is as follows:
EDIT: The datasets are uploaded in resources/corpora/interaction
The LLL05 challenge task corpus consists of a training and test set (55+25/86 sentences, respectively) with annotations for protein/gene interactions. There are two different versions of the corpus, a basic and an enriched data set. The enriched data set contains further linguistic information (lemmas and syntactic dependencies). The training corpus is organized in two separate parts, one containing only 'simple' sentences, the other including coreferences and ellipis. The LLL05 corpus distinguishes between agents and targets for each relation. In addition, the different types of relations are grouped (explicit action, protein-gene promotor binding, regulon family membership).
Furthermore, this is an example from their dataset:
Example
ID 11011148-1
sentence ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.
words word(0,'ykuD',0,3) word(1,'was',5,7) word(2,'transcribed',9,19) word(3,'by',21,22) word(4,'SigK',24,27) word(5,'RNA',29,31) word(6,'polymerase',33,42) word(7,'from',44,47) word(8,'T4',49,50) word(9,'of',52,53) word(10,'sporulation',55,65)
agents agent(4)
targets target(0)
genic_interactions genic_interaction(4,0)