clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Added ATCC and Xia's entries #14

Closed enoriega closed 8 years ago

enoriega commented 8 years ago

Added two groups of items to the NER manual list

Xia’s annotations: Some annotations that we need to have in order to use our context data set for training. These entries have been checked and they have no duplicate in any existing context KB file.

ATCC cell lines: I added these to the same file, although I feel they should be in their own stand alone file. The entries in this dictionary haven’t been diffed yet with Cellosaurus.

Let’s adjust this as necessary in the pull request.

MihaiSurdeanu commented 8 years ago

@hickst: can you please check this PR?

enoriega commented 8 years ago

What's the problem with: # HDFs UA-CLine-100087 uaz CellLine?

enoriega commented 8 years ago

The atcc.tsv file contains a new dictionary of cell lines. I feel it should be added as a standalone file instead of appended to the NER Override file. All the entries in the file have external "first degree" context like species, disease, cell type, etc. Once we are done with the extension of the NER override file lets analyze this one.

hickst commented 8 years ago

Identified the following issues in the NER file additions:

1) Should HDFs refer to one of these:

Cellosaurus.tsv.gz:HDF-FOP      CVCL_W541       Homo sapiens
Cellosaurus.tsv.gz:HDF-FOP      CVCL_W542       Homo sapiens
Cellosaurus.tsv.gz:HDF/TERT1    CVCL_9Q55       Homo sapiens

2) Should HUVECs refer to one of these:

Cellosaurus.tsv.gz:HUVEC-C      CVCL_2959       Homo sapiens
Cellosaurus.tsv.gz:HUVEC-CS     CVCL_0F27       Homo sapiens
Cellosaurus.tsv.gz:HUVEC/TERT2  CVCL_9Q53       Homo sapiens

3) Should MCF10As refer to one of these:

Cellosaurus.tsv.gz:MCF10A       CVCL_0598       Homo sapiens
Cellosaurus.tsv.gz:MCF10A       CVCL_5555       Homo sapiens
Cellosaurus.tsv.gz:MCF10A-Er-Src        CVCL_N805       Homo sapiens
Cellosaurus.tsv.gz:MCF10A-Myc   CVCL_0411       Homo sapiens
Cellosaurus.tsv.gz:MCF10A-neo   CVCL_6C54       Homo sapiens
Cellosaurus.tsv.gz:MCF10AMy     CVCL_0411       Homo sapiens
Cellosaurus.tsv.gz:MCF10Ane     CVCL_6C54       Homo sapiens
Cellosaurus.tsv.gz:MCF10Aneo    CVCL_6C55       Homo sapiens
Cellosaurus.tsv.gz:MCF10AneoT   CVCL_5554       Homo sapiens

4) Should MECs refer to one of these:

Cellosaurus.tsv.gz:MEC  CVCL_1870       Homo sapiens
Cellosaurus.tsv.gz:MEC  CVCL_1871       Homo sapiens
Cellosaurus.tsv.gz:MEC  CVCL_B270       Homo sapiens
Cellosaurus.tsv.gz:MEC- CVCL_F938       Mus musculus
Cellosaurus.tsv.gz:MEC- CVCL_F939       Mus musculus
Cellosaurus.tsv.gz:MEC- CVCL_F940       Mus musculus
Cellosaurus.tsv.gz:MEC- CVCL_F941       Mus musculus

5) Should MEFs refer to one of these:

Cellosaurus.tsv.gz:MEF (C57BL/6)        CVCL_9115       Mus musculus
Cellosaurus.tsv.gz:MEF (C57BL/6) IRR    CVCL_9117       Mus musculus
Cellosaurus.tsv.gz:MEF (C57BL/6) MITC   CVCL_9118       Mus musculus
Cellosaurus.tsv.gz:MEF (CF-1    CVCL_5251       Mus musculus
Cellosaurus.tsv.gz:MEF (CF-1) IR        CVCL_K232       Mus musculus
Cellosaurus.tsv.gz:MEF (CF-1) MIT       CVCL_K233       Mus musculus
Cellosaurus.tsv.gz:MEF (DR4     CVCL_5277       Mus musculus
Cellosaurus.tsv.gz:MEF (DR4) MIT        CVCL_Y468       Mus musculus
Cellosaurus.tsv.gz:MEF PKCe KO  CVCL_AS81       Mus musculus
Cellosaurus.tsv.gz:MEF PKCe KO KI       CVCL_AS82       Mus musculus
Cellosaurus.tsv.gz:MEF Ulk1 -/- Ulk2 -/- (DKO) (SIM     CVCL_5A56       Mus musculus
Cellosaurus.tsv.gz:MEF Ulk1 -/- Ulk2 -/- (DKO) (SV40    CVCL_5A57       Mus musculus
Cellosaurus.tsv.gz:MEF-1 [Human myeloma]        CVCL_M515       Homo sapiens
Cellosaurus.tsv.gz:MEF-1 [Mouse fibroblast]     CVCL_4240       Mus musculus
Cellosaurus.tsv.gz:MEF-BL/6-    CVCL_9115       Mus musculus
Cellosaurus.tsv.gz:MEF1 CVCL_4240       Mus musculus

6) The entry le UA-CT-30007 conflicts with chemical in PubChem and ChEBI.

7) The entry tcp-1 UA-CLine-100117 conflicts with a Uniprot protein.

hickst commented 8 years ago

Sorry, I forgot to push the previous comment before I left for the meeting location crisis. 1 through 6 are not necessarily problems -- I just want Xia to verify that we're not creating duplicate IDs.

enoriega commented 8 years ago

Rigth, Thanks!

El ago 25, 2016, a las 10:36 PM, Tom Hicks notifications@github.com escribió:

Sorry, I forgot to push the previous comment before I left for the meeting location crisis. 1 through 6 are not necessarily problems -- I just want Xia to verify that we're not creating duplicate IDs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

enoriega commented 8 years ago

@hickst I updated the atcc.tsv file with ~120 entries to be used as a CellLine dictionary

MihaiSurdeanu commented 8 years ago

@hickst: can you please double check and merge?

hickst commented 8 years ago

Not merging yet: we're testing Processors and need to integrate this new KB into Reach and test before merging, so we'll be adding files and changes to this PR.