In any business, word documents are a common occurence. They contain information in the form of raw text, tables and images. All of them contain important facts. The data used in this code pattern comes from two Wikipedia articles. The first is taken from the Wikipedia page of oncologist Suresh H. Advani the second is from the Wikipedia page about Oncology. These files are zipped up as archive.zip.
In the figure below, there is a textual information about an oncologist Suresh H. Advani present in a word document. The table consists of the awards that he has been awarded by various organisations.
In this Code pattern, we address the problem of extracting knowledge out of text and tables in word documents. A knowledge graph is built from the knowledge extracted making the knowledge queryable.
Some of the challenges in extracting knowledge from word documents are:
This pattern uses the below methodology to overcome the challenges:
python package mammoth
library is used to convert .docx
files to html (semi-structured format).The best of both worlds - training and rules based approach is used to extract knowledge out of documents.
In this Pattern we will demonstrate:
What makes this Code Pattern valuable:
This Code Pattern is intended to help Developers, Data Scientists to give structure to the unstructured data. This can be used to shape their analysis significantly and use the data for further processing to get better Insights.
Follow these steps to setup and run this code pattern. The steps are described in detail below.
Create the following IBM Cloud service and name it wdc-NLU-service:
Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
Create a new project by clicking + New project
and choosing Data Science
:
Enter a name for the project name and click Create
.
NOTE: By creating a project in Watson Studio a free tier Object Storage
service and Watson Machine Learning
service will be created in your IBM Cloud account. Select the Free
storage type to avoid fees.
From the new project Overview
panel, click + Add to project
on the top right and choose the Notebook
asset type.
Fill in the following information:
From URL
tab. [1]Name
for the notebook and optionally a description. [2]Notebook URL
provide the following url: https://raw.githubusercontent.com/IBM/build-knowledge-base-with-domain-specific-documents/master/notebooks/knowledge_graph.ipynb [3]Runtime
select the Python 3.5
option. [4]Click the Create
button.
TIP: Once successfully imported, the notebook should appear in the Notebooks
section of the Assets
tab.
Use the menu pull-down Cell > Run All
to run the notebook, or run the cells one at a time top-down using the play button.
As the cells run, watch the output for results or errors. A running cell will have a label like In [*]
. A completed cell will have a run sequence number instead of the asterisk.
This notebook uses the data in datqa. We need to load these assets to our project.
From the new project Overview
panel, click + Add to project
on the top right and choose the Data
asset type.
A panel on the right of the screen will appear to assit you in uploading data. Follow the numbered steps in the image below.
Load
tab. [1]browse
option. From your machine, browse to the location of the archive.zip
, config_relations.txt
, and config_classification.txt
files in this repository, and upload it. [not numbered]Files
tab. [2]NOTE: It is possible to use your own data and configuration files. If you use a configuration file from your computer, make sure to conform to the JSON structure given in
data/config_classification.txt
.
As we step through the notebook we first the following:
config_classification.txt
and config_relations.txt
) are loaded..docx
files to .html
from where text in the tables is also analysed along with free floating text.config_classification.txt
and the relationships are augmented using config_relations.txt
.This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.