allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.7k stars 227 forks source link

How to visualize named entities in custom colors #141

Closed phosseini closed 5 years ago

phosseini commented 5 years ago

There's an options in Spacy which allows us to use custom colors for named entity visualization. I'm trying to use the same options in scispacy for the named entities. I simply created two lists of entities and randomly generated colors and put them in options dictionary like the following:

options = {"ents": entities, "colors": colors}

Where entities is a list of NEs in scispacy NER models and colors is a list of the same size. But using such an option in either displacy.serve or displacy.render (for jupyter) does not work. I'm using the options like the following:

displacy.serve(doc, style="ent", options=options)

I wonder if using the color option only works for predefined named entities in the Spacy or there's something wrong with the way I'm using the option?

DeNeutoy commented 5 years ago

colors should be a dictionary mapping {"TAG": "color"}, not a list, see here:

https://spacy.io/usage/visualizers#ent

phosseini commented 5 years ago

colors should be a dictionary mapping {"TAG": "color"}, not a list, see here:

https://spacy.io/usage/visualizers#ent

Sorry, I should correct myself, colors is a dictionary with the same names in my entities list as keys. And, still it does not work.

DeNeutoy commented 5 years ago

Can you provide an actual code snippet, because doing this with custom colours has actually worked for me before, so I know that it works :)

The entities that scispacy detects are called ENT, do you have that in your colour dict?

phosseini commented 5 years ago

I just realized what the problem was. When I load the full spaCy pipeline model like the following:

nlp = spacy.load("en_core_sci_md")

Named entities won't be visualized, but when I load a specific NER model, like the following:

nlp = spacy.load("en_ner_bionlp13cg_md")

I do see the visualized named entities.

then I think my question is, if I want to visualize all the available named entities in scispacy (27 named entities,) should I call the NER models separately? Isn't there a way to use the full pipeline model and visualize all the named entities at once?

DeNeutoy commented 5 years ago

Again, if you provide me with a code snippet of what you did for the en_core_sci_md model, I can help - I know for a fact that it is possible to visualize entities using this model.

In terms of combining the other models together, this is a bit tricky in Spacy as the NER model is stateful (meaning that if you run more than one NER model in a pipeline it doesn't work properly). So unfortunately that might be a bit tricky, sorry!

phosseini commented 5 years ago

@DeNeutoy This is the exact code I'm using:

import scispacy
import spacy
from spacy import displacy

# nlp = spacy.load("en_ner_jnlpba_md")
nlp = spacy.load("en_core_sci_md")

text = """The purpose of our study was to learn the distribution characteristics of cancer stem cell markers (CD24, CD44) in invasive carcinomas with different grade and molecular subtype. For research was used 1324 postoperative breast cancer samples, from which were selected 393 patient with invasive ductal carcinoma samples examined 2008-2012 in Laboratory of "Pathgeo Union of Pathologist" is and N.Kipshidze Central University Hospital. The age range is between 23-73 year. For all cases were performed immunohistochemical study using ER, PR, Her2, Ki67, CK5- molecular markers (Leica Microsystems). For identify cancer stem cells mononuclear antibodies CD24 (BIOCARE MEDICAL, CD44 - Clone 156-3C11; CD24 - Clone SN3b) were used. Association of CD44/CD24 expression in different subtypes of cells, between clinicopathological parameters and different biological characteristics were performed by Pearson correlation and usind X2 tests. Obtained quantitative statistical analyses were performed by using SPSS V.19.0 program. Statistically significant were considered 95% of confidence interval. The data shows, that towards G1-G3, amount of CD44 positive cases increased twice. CD44 positive cases are evenly distributed between Luminal A, Luminal B, HER2+, triple negative basal like cell subtypes and in significantly less (4,8 times) in Her2+ cases. Maximum amount of CD44 negative cases is shown in Luminal A subtype, which could be possible cause of better prognosis and high sensitivity for chemotherapy. For one's part such aggressive subtypes of breast cancer as Luminal B and basal like cell type, are characterized by CD44 positive and antigen high expression, which can be reason of aggressive nature of this types and also reason of chemotherapy resistance. As well as amount of CD24 positive cases according to malignancy degree, also antigen expression features does not show any type of correlation between malignancy degree and CD24 positivity or with CD24 expression features, or presence of stem cells. That can be the reason of tumor aggressivity and chemoresistance. exceptions are Her2 positive tumors because they have different base of carcinogenesis."""

doc = nlp(text)
options = get_entity_options()
displacy.render(doc, style='ent', options=options)

Where get_entity_options() is a method I wrote for getting the color options like the following (everybody, feel free to use it if you find it useful):

import random 

def get_entity_options(random_colors=False):
    """
    generating color options for visualizing the named entities
    """
    def color_generator(number_of_colors):
        color = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
                 for i in range(number_of_colors)]
        return color

    entities = ["GGP", "SO", "TAXON", "CHEBI", "GO", "CL", 
                "DNA", "CELL_TYPE", "CELL_LINE", "RNA", "PROTEIN",
                "DISEASE", "CHEMICAL",
                "CANCER", "ORGAN", "TISSUE", "ORGANISM", "CELL", "AMINO_ACID", "GENE_OR_GENE_PRODUCT", "SIMPLE_CHEMICAL", "ANATOMICAL_SYSTEM", "IMMATERIAL_ANATOMICAL_ENTITY", "MULTI-TISSUE_STRUCTURE", "DEVELOPING_ANATOMICAL_STRUCTURE", "ORGANISM_SUBDIVISION", "CELLULAR_COMPONENT"]

    colors = {"ENT":"#E8DAEF"}

    if random_colors:
        color = color_generator(len(entities))
        for i in range(len(entities)):
            colors[entities[i]] = color[i]
    else:
        entities_cat_1 = {"GGP":"#F9E79F", "SO":"#F7DC6F", "TAXON":"#F4D03F", "CHEBI":"#FAD7A0", "GO":"#F8C471", "CL":"#F5B041"}
        entities_cat_2 = {"DNA":"#82E0AA", "CELL_TYPE":"#AED6F1", "CELL_LINE":"#E8DAEF", "RNA":"#82E0AA", "PROTEIN":"#82E0AA"}
        entities_cat_3 = {"DISEASE":"#D7BDE2", "CHEMICAL":"#D2B4DE"}
        entities_cat_4 = {"CANCER":"#ABEBC6", "ORGAN":"#82E0AA", "TISSUE":"#A9DFBF", "ORGANISM":"#A2D9CE", "CELL":"#76D7C4", "AMINO_ACID":"#85C1E9", "GENE_OR_GENE_PRODUCT":"#AED6F1", "SIMPLE_CHEMICAL":"#76D7C4", "ANATOMICAL_SYSTEM":"#82E0AA", "IMMATERIAL_ANATOMICAL_ENTITY":"#A2D9CE", "MULTI-TISSUE_STRUCTURE":"#85C1E9", "DEVELOPING_ANATOMICAL_STRUCTURE":"#A9DFBF", "ORGANISM_SUBDIVISION":"#58D68D", "CELLULAR_COMPONENT":"#7FB3D5"}

        entities_cats = [entities_cat_1, entities_cat_2, entities_cat_3, entities_cat_4]
        for item in entities_cats:
            colors = {**colors, **item}

    options = {"ents": entities, "colors": colors}

    return options

Using the full model, I can't see any visualization, but when I switch to a specific NER model I do see the visualization.

victoriastuart commented 4 years ago

@phosseini : very cool, thank you! I added your code (for my own use / tests) as a method, giving the following results! :-)


entity_options.py

## Source: https://github.com/allenai/scispacy/issues/141#issuecomment-518274586
## Author: https://github.com/phosseini
##   File: /mnt/Vancouver/apps/spacy/entity_options.py
##    Env: Python 3.7 venv:
##    Use:
##          import entity_options
##          from entity_options import get_entity_options
##          displacy.serve(doc, style="ent", options=get_entity_options(random_colors=True))
##    Ent: https://github.com/allenai/scispacy/issues/79#issuecomment-557766506 ## CRAFT entities

import random 

def get_entity_options(random_colors=False):
    """ generating color options for visualizing the named entities """

    def color_generator(number_of_colors):
        color = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]) for i in range(number_of_colors)]
        return color

    entities = ["GGP", "SO", "TAXON", "CHEBI", "GO", "CL", "DNA", "CELL_TYPE", "CELL_LINE", "RNA", "PROTEIN", \
                "DISEASE", "CHEMICAL", "CANCER", "ORGAN", "TISSUE", "ORGANISM", "CELL", "AMINO_ACID", \
                "GENE_OR_GENE_PRODUCT", "SIMPLE_CHEMICAL", "ANATOMICAL_SYSTEM", "IMMATERIAL_ANATOMICAL_ENTITY", \
                "MULTI-TISSUE_STRUCTURE", "DEVELOPING_ANATOMICAL_STRUCTURE", "ORGANISM_SUBDIVISION", "CELLULAR_COMPONENT"]

    colors = {"ENT":"#E8DAEF"}

    if random_colors:
        color = color_generator(len(entities))
        for i in range(len(entities)):
            colors[entities[i]] = color[i]
    else:
        entities_cat_1 = {"GGP":"#F9E79F", "SO":"#F7DC6F", "TAXON":"#F4D03F", "CHEBI":"#FAD7A0", "GO":"#F8C471", "CL":"#F5B041"}
        entities_cat_2 = {"DNA":"#82E0AA", "CELL_TYPE":"#AED6F1", "CELL_LINE":"#E8DAEF", "RNA":"#82E0AA", "PROTEIN":"#82E0AA"}
        entities_cat_3 = {"DISEASE":"#D7BDE2", "CHEMICAL":"#D2B4DE"}
        entities_cat_4 = {"CANCER":"#ABEBC6", "ORGAN":"#82E0AA", "TISSUE":"#A9DFBF", "ORGANISM":"#A2D9CE", "CELL":"#76D7C4", \
                          "AMINO_ACID":"#85C1E9", "GENE_OR_GENE_PRODUCT":"#AED6F1", "SIMPLE_CHEMICAL":"#76D7C4", "ANATOMICAL_SYSTEM":"#82E0AA", \
                          "IMMATERIAL_ANATOMICAL_ENTITY":"#A2D9CE", "MULTI-TISSUE_STRUCTURE":"#85C1E9", "DEVELOPING_ANATOMICAL_STRUCTURE":"#A9DFBF", \
                          "ORGANISM_SUBDIVISION":"#58D68D", "CELLULAR_COMPONENT":"#7FB3D5"}

        entities_cats = [entities_cat_1, entities_cat_2, entities_cat_3, entities_cat_4]
        for item in entities_cats:
            colors = {**colors, **item}

    options = {"ents": entities, "colors": colors}
    # print(options)
    return options

Python 3.7 venv

(py3.7) [victoria@victoria spacy]$ date; pwd; ls -l

  Tue 26 Nov 2019 01:45:28 PM PST
  /mnt/Vancouver/apps/spacy
  total 20
  -rw-r--r-- 1 victoria victoria 2287 Nov 26 13:42 entity_options.py
  drwxr-xr-x 2 victoria victoria 4096 Nov 26 13:34 __pycache__
  -rw------- 1 victoria victoria 3560 Nov 26 12:02 readme-victoria-spacy.txt
  drwxr-xr-x 3 victoria victoria 4096 Nov 19 11:41 scispacy
  -rw-r--r-- 1 victoria victoria 2624 Nov 26 11:59 spacy_srl.py

(py3.7) [victoria@victoria spacy]$ pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_craft_md-0.2.4.tar.gz
  ...
  Successfully installed blis-0.4.1 catalogue-0.0.8 en-ner-craft-md-0.2.4 preshed-3.0.2 spacy-2.2.3 thinc-7.3.1

(py3.7) [victoria@victoria spacy]$ env | grep -i virtual
VIRTUAL_ENV=/home/victoria/venv/py3.7

(py3.7) [victoria@victoria spacy]$ python --version
Python 3.7.4

(py3.7) [victoria@victoria spacy]$ python
  Python 3.7.4 (default, Nov 20 2019, 11:36:53) 
  [GCC 9.2.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.

>>> import spacy
>>> from spacy import displacy

>>> text = "26902145. Breast cancer susceptibility gene 1 (BRCA1) is a tumor suppressor protein that functions to maintain genomic stability through critical roles in DNA repair, cell-cycle arrest, and transcriptional control. The androgen receptor (AR) is expressed in more than 70% of breast cancers and has been implicated in breast cancer pathogenesis. However, little is known about the role of BRCA1 in AR-mediated cell proliferation in human breast cancer. Here, we report that a high expression of AR in breast cancer patients was associated with shorter overall survival (OS) using a tissue microarray with 149 non-metastatic breast cancer patient samples. We reveal that overexpression of BRCA1 significantly inhibited expression of AR through activation of SIRT1 in breast cancer cells. Meanwhile, SIRT1 induction or treatment with a SIRT1 agonist, resveratrol, inhibits AR-stimulated proliferation. Importantly, this mechanism is manifested in breast cancer patient samples and TCGA database, which showed that low SIRT1 gene expression in tumor tissues compared with normal adjacent tissues predicts poor prognosis in patients with breast cancer. Taken together, our findings suggest that BRCA1 attenuates AR-stimulated proliferation of breast cancer cells via SIRT1 mediated pathway. | 30714292. Breast cancer susceptibility gene 1 (BRCA1) has been implicated in modulating metabolism via transcriptional regulation. However, direct metabolic targets of BRCA1 and the underlying regulatory mechanisms are still unknown. Here, we identified several metabolic genes, including the gene which encodes glutamate‐oxaloacetate transaminase 2 (GOT2), a key enzyme for aspartate biosynthesis, which are repressed by BRCA1. We report that BRCA1 forms a co‐repressor complex with ZBRK1 that coordinately represses GOT 2 expression via a ZBRK1 recognition element in the promoter of GOT2. Impairment of this complex results in upregulation of GOT2, which in turn increases aspartate and alpha ketoglutarate production, leading to rapid cell proliferation of breast cancer cells. Importantly, we found that GOT2 can serve as an independent prognostic factor for overall survival and disease‐free survival of patients with breast cancer, especially triple‐negative breast cancer. Interestingly, we also demonstrated that GOT2 overexpression sensitized breast cancer cells to methotrexate, suggesting a promising precision therapeutic strategy for breast cancer treatment. In summary, our findings reveal that BRCA1 modulates aspartate biosynthesis through transcriptional repression of GOT2, and provides a biological basis for treatment choices in breast cancer. | BRCA1/2. BRCA1 and BRCA2 (BRCA1/2) are human genes that produce tumor suppressor proteins."

>>> nlp = spacy.load("en_ner_craft_md")
>>> doc = nlp(text)

>>> import entity_options
>>> from entity_options import get_entity_options

>>> get_entity_options()                               ## default: (random_colors=False)
{'colors': {'AMINO_ACID': '#85C1E9',
            'ANATOMICAL_SYSTEM': '#82E0AA',
            'CANCER': '#ABEBC6',
            'CELL': '#76D7C4',
            'CELLULAR_COMPONENT': '#7FB3D5',
            'CELL_LINE': '#E8DAEF',
            'CELL_TYPE': '#AED6F1',
            'CHEBI': '#FAD7A0',
            'CHEMICAL': '#D2B4DE',
            'CL': '#F5B041',
            'DEVELOPING_ANATOMICAL_STRUCTURE': '#A9DFBF',
            'DISEASE': '#D7BDE2',
            'DNA': '#82E0AA',
            'ENT': '#E8DAEF',
            'GENE_OR_GENE_PRODUCT': '#AED6F1',
            'GGP': '#F9E79F',
            'GO': '#F8C471',
            'IMMATERIAL_ANATOMICAL_ENTITY': '#A2D9CE',
            'MULTI-TISSUE_STRUCTURE': '#85C1E9',
            'ORGAN': '#82E0AA',
            'ORGANISM': '#A2D9CE',
            'ORGANISM_SUBDIVISION': '#58D68D',
            'PROTEIN': '#82E0AA',
            'RNA': '#82E0AA',
            'SIMPLE_CHEMICAL': '#76D7C4',
            'SO': '#F7DC6F',
            'TAXON': '#F4D03F',
            'TISSUE': '#A9DFBF'},
 'ents': ['GGP',
          'SO',
          'TAXON',
          'CHEBI',
          'GO',
          'CL',
          'DNA',
          'CELL_TYPE',
          'CELL_LINE',
          'RNA',
          'PROTEIN',
          'DISEASE',
          'CHEMICAL',
          'CANCER',
          'ORGAN',
          'TISSUE',
          'ORGANISM',
          'CELL',
          'AMINO_ACID',
          'GENE_OR_GENE_PRODUCT',
          'SIMPLE_CHEMICAL',
          'ANATOMICAL_SYSTEM',
          'IMMATERIAL_ANATOMICAL_ENTITY',
          'MULTI-TISSUE_STRUCTURE',
          'DEVELOPING_ANATOMICAL_STRUCTURE',
          'ORGANISM_SUBDIVISION',
          'CELLULAR_COMPONENT']}

>>> get_entity_options(random_colors=True)
{'colors': {'AMINO_ACID': '#30CBF7',
            'ANATOMICAL_SYSTEM': '#6DF980',
            'CANCER': '#1AE0F9',
            'CELL': '#5813C7',
            'CELLULAR_COMPONENT': '#0D350E',
            'CELL_LINE': '#1AA436',
            'CELL_TYPE': '#F837CC',
            'CHEBI': '#54B69E',
            'CHEMICAL': '#BADCA1',
            'CL': '#D845FB',
            'DEVELOPING_ANATOMICAL_STRUCTURE': '#0D9CB4',
            'DISEASE': '#78A2E5',
            'DNA': '#CAD406',
            'ENT': '#E8DAEF',
            'GENE_OR_GENE_PRODUCT': '#EC2144',
            'GGP': '#A6AA7D',
            'GO': '#8312F0',
            'IMMATERIAL_ANATOMICAL_ENTITY': '#F7E433',
            'MULTI-TISSUE_STRUCTURE': '#221891',
            'ORGAN': '#786BC0',
            'ORGANISM': '#43534C',
            'ORGANISM_SUBDIVISION': '#B6F342',
            'PROTEIN': '#4454D9',
            'RNA': '#64C158',
            'SIMPLE_CHEMICAL': '#F8616A',
            'SO': '#344E4D',
            'TAXON': '#63B69D',
            'TISSUE': '#0DE67C'},
 'ents': ['GGP',
          'SO',
          'TAXON',
          'CHEBI',
          'GO',
          'CL',
          'DNA',
          'CELL_TYPE',
          'CELL_LINE',
          'RNA',
          'PROTEIN',
          'DISEASE',
          'CHEMICAL',
          'CANCER',
          'ORGAN',
          'TISSUE',
          'ORGANISM',
          'CELL',
          'AMINO_ACID',
          'GENE_OR_GENE_PRODUCT',
          'SIMPLE_CHEMICAL',
          'ANATOMICAL_SYSTEM',
          'IMMATERIAL_ANATOMICAL_ENTITY',
          'MULTI-TISSUE_STRUCTURE',
          'DEVELOPING_ANATOMICAL_STRUCTURE',
          'ORGANISM_SUBDIVISION',
          'CELLULAR_COMPONENT']}

## default: get_entity_options(random_colors=False)
## displacy.serve(doc, style="ent", options=get_entity_options())

>>> displacy.serve(doc, style="ent", options=get_entity_options(random_colors=True))

  Using the 'ent' visualizer
  Serving on http://0.0.0.0:5000 ...
  127.0.0.1 -- -- [26/Nov/2019 13:43:47] "GET / HTTP/1.1" 200 20529

Screenshots

random colors = False:

spacy_tagged_text_browser-2019-11-26c

random colors = True:

spacy_tagged_text_browser-2019-11-26d

phosseini commented 4 years ago

@victoriastuart Nice comparison! I tried to define colors in a way so that different categories are distinguishable and also entities in the same category are in the same color range. Definitely it can be improved but better than random colors I think :)

victoriastuart commented 4 years ago

Again, very appreciative of your code: so much fun to try and also quite useful I think! :-)

kirandiquery commented 3 years ago

how to save this result into a data frame?