Add local grounding services

maxkfranz commented 5 years ago

Description

Q: What is the name of the feature?

A: Local grounding services

Q: What does this feature enable the user to do?

A: Ground proteins, genes, chemicals, GO terms

Q: What information must the user provide to use the feature?

A: The name or synonym of the entity

Q: What are the applicable constraints, e.g. compatibility or performance?

A: Must perform at least as quickly as remote services. Ideally, all services should return in less than one second.

Specification

API

[ ] Use the same API as the current element-association services. Each service exports { namespace, search, get, distanceFields }.
[ ] Implement local grounding services (i.e. no external requests) for each of
- [ ] Uniprot
- [ ] PubChem
- [ ] Pfam
- [ ] GO
[ ] Swap out and/or add the new local services to the list of services used by the aggregate search

Background

PubChem and Chebi are unreliable.
Pfam doesn't have useable service.
Overall we can't trust that external services will be stable enough.

maxkfranz commented 5 years ago

Supersedes #228 #180

metincansiper commented 5 years ago

@maxkfranz I made a commit c973c2aec776b7b7e80486e362ec8b2cb79b0361, would you review it? The commit only handles uniprot (both writing to db and querying from db). For writing to db it includes a nodejs script file which expects that there is an xml file containing uniprot data is already available. Later I can add a bash script that download xml file and calls the nodejs script, if that looks fine.

Edit: I have a few questions in mind, but it would be better if I can ask them in developers meeting tomorrow.

metincansiper commented 5 years ago

I forget to add one of the files (the script that writes uniprot data to db) to the previous commit . Added it by 42dcfba2190a042d1113851cbb4f800453f143dc

maxkfranz commented 5 years ago

Step 1:

[x] Automate the population of the db with the uniprot data on server start (npm start or npm run watch)
- [x] Init script made into some reusable function
- [x] In the main server index file (express etc.), express should not be started until the init script is done
- [x] The init script should be lazy -- if the table exists in the db, then skip the script
- [x] npm scripts -- none of these should need to start the factoid server
- [x] npm run grounding:update -- manually update the grounding table
- [x] npm run grounding:delete -- manually delete grounding table

Step 2:

[ ] Optimise
- [ ] Build indices for each of the string fields that get searched. See "Using secondary indexes in RethinkDB": https://rethinkdb.com/docs/secondary-indexes/javascript/
- [ ] Optimise the query to use the index. It shouldn't have to use operators likesbranch, typeOf,

Step 3:

[ ] Enable other organisms -- just the ones in the Organism enum
[ ] Dump the processed table so it can be cached: https://rethinkdb.com/docs/backup/
[x] Test and compare the results to the current live service

maxkfranz commented 5 years ago

@metincansiper Jeff and I have reviewed the service by trying it out locally with the uniprot.xml file. It looks like it's giving great results. The performance could be improved a bit though, and we've updated the checklist with an "Optimise" section. Let us know if you have any questions.

maxkfranz commented 5 years ago

TODO investigate whether git LFS and/or Github releases would be a good fit for storing the database dumps:

https://git-lfs.github.com

https://help.github.com/articles/distributing-large-binaries/

metincansiper commented 5 years ago

@maxkfranz (About the optimization step) I can see that using secondary indexes could be very useful to speed the queries up if we were expecting queries to find the exact text matches. However, current queries are looking for the the strings that contains the search text in some part of string fields in a case insensitive way (I mean if the search text is "tp" then "TP53" is count as a match).

I could not find any way of utilizing rethinkdb secondary indexes while expecting the search queries function as they currently do. Do you have any way of achieving this in mind?

metincansiper commented 5 years ago

Actually creating multi indices that contains array of any possible string combinations (for "abc" it would be ["a", "b, "c", "ab", "ac", "abc", "A", "B", "C", "Ab", "aB", "AB", ...]) would work (Also for fields like "gene names" where we have string arrays all of such arrays, created for each string, would should be combined). However, I suppose it may not be good in practice.

maxkfranz commented 5 years ago

@metincansiper See "Indexes on arbitrary ReQL expressions". That might be the easiest option.

It requires that you only use the rdb API for the expression though -- not general JS functions.

metincansiper commented 5 years ago

@maxkfranz I think indexes based on arbitrary expressions would be needed anyways. Did you mean using indexes on arbitrary expressions to create something like I proposed above (for "abc" indexing to ["a", "b, "c", "ab", "ac", "abc", "A", "B", "C", "Ab", "aB", "AB", ...])?

Edit: As I mentioned I am not sure how efficient it would be in practice. If you think it looks okay I can implement it.

maxkfranz commented 5 years ago

I don't think you'll have to mutate the existing string values, but you'll just have to include all of the string fields in each protein entry.

metincansiper commented 5 years ago

@maxkfranz actually I am not sure if I am at the same page. What I mean actually is not modifying the existing string values but creating indexes based on substrings of actual strings. As an example:

// create secondary index for "name" field here
r.table('uniprot').indexCreate("subnames", entry => {
    // return an array of all substrings here (use uppercases only, so need to use uppercases 
    // on get query too. E.g. If entry('name') is "abc" return ["A", "B, "C", "AB", "AC", "ABC"]
}, {multi: true})

// make a search query for "name" field here
r.table('uniprot').getAll( toUpperCase(searchText), {index: "subnames"} )

Do you mean even returning the substrings array, as above, is not needed? If so how can I make the search query without using these substrings in return value of my index function (For example it will need to catch "abc" when "bc" is searched)?

BTW I made a commit for the "Step 1".

maxkfranz commented 5 years ago

https://www.elastic.co/products/elasticsearch

https://www.npmjs.com/package/elasticsearch

maxkfranz commented 5 years ago

The separate project has been set up: https://github.com/PathwayCommons/grounding-search

PathwayCommons / factoid