clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Grounded simple-chemical: Accessing International Chemical Identifier (InChIKey) #579

Open jvwong opened 6 years ago

jvwong commented 6 years ago

Background: We've recently been in contact regarding getting a local instance of REACH up and running to process full-text articles for our project Factoid (see #551). We are using REACH grounding information for simple chemicals. Currently, information from PubChem (#167) is returned, but we are interested in retrieving records from other small-molecule databases, namely ChEBI.

Issue: Is REACH able to expose the International Chemical Identifier hash (InChIKey) for simple-chemicals so that each grounded entity can be unambiguously/directly looked up 'elsewhere'?

hickst commented 6 years ago

Grounding via ChEBI is available in Reach. To meet previous contractual dictates, we had to stop using it but it is available if you compile your own version of the Bioresources project. Note: to do so, you will also need to compile the Processors library first as the code to modify Bioresources is contained in Processors (you do not need to use this version of Processors at runtime, you just need it to recompile Bioresources.

To compile your own version of Bioresources: 1) edit the ner_kb.config file: a) find the chebi entry, swap it with the PubChem entry, uncomment chebi and comment-out PubChem. b) similarly, move and uncomment hmdb if you would like to supplement chebi lookup.

2) Next, run the script to regenerate the lexicon files: a) Insure that you have a compiled version of Processors available at the same level as the Bioresources project (i.e. "sister" directories), as the script is hardwired to use this structure (our apologies for that but the script was really created for our private, internal, infrequent use). b) From Bioresources, run ner_kb.sh. The script takes a few minutes to regenerate the files but should not generate any errors.

3) Compile the Bioresources JAR file, in this example to your local repository: a) sbt clean publishLocal

I am attaching an example ner_kb.config file, modified to use ChEBI (note that I've also enabled HMDB as a supplementary KB. If this is not desired, just comment the hmdb line out).

Because of annoying GitHub limitations there is an extra .txt extension on this file name: ner_kb.config.txt

maxkfranz commented 6 years ago

It's great to know that you support Chebi, but regardless of the underlying system, do you support returning InchiKeys?

hickst commented 6 years ago

Sorry, no.