Closed maxkfranz closed 5 years ago
Supersedes #228 #180
@maxkfranz I made a commit c973c2aec776b7b7e80486e362ec8b2cb79b0361, would you review it? The commit only handles uniprot (both writing to db and querying from db). For writing to db it includes a nodejs script file which expects that there is an xml file containing uniprot data is already available. Later I can add a bash script that download xml file and calls the nodejs script, if that looks fine.
Edit: I have a few questions in mind, but it would be better if I can ask them in developers meeting tomorrow.
I forget to add one of the files (the script that writes uniprot data to db) to the previous commit . Added it by 42dcfba2190a042d1113851cbb4f800453f143dc
Step 1:
npm start
or npm run watch
)
npm run grounding:update
-- manually update the grounding tablenpm run grounding:delete
-- manually delete grounding tableStep 2:
branch
, typeOf
, Step 3:
@metincansiper Jeff and I have reviewed the service by trying it out locally with the uniprot.xml file. It looks like it's giving great results. The performance could be improved a bit though, and we've updated the checklist with an "Optimise" section. Let us know if you have any questions.
TODO investigate whether git LFS and/or Github releases would be a good fit for storing the database dumps:
https://help.github.com/articles/distributing-large-binaries/
@maxkfranz (About the optimization step) I can see that using secondary indexes could be very useful to speed the queries up if we were expecting queries to find the exact text matches. However, current queries are looking for the the strings that contains the search text in some part of string fields in a case insensitive way (I mean if the search text is "tp" then "TP53" is count as a match).
I could not find any way of utilizing rethinkdb secondary indexes while expecting the search queries function as they currently do. Do you have any way of achieving this in mind?
Actually creating multi indices that contains array of any possible string combinations (for "abc" it would be ["a", "b, "c", "ab", "ac", "abc", "A", "B", "C", "Ab", "aB", "AB", ...]) would work (Also for fields like "gene names" where we have string arrays all of such arrays, created for each string, would should be combined). However, I suppose it may not be good in practice.
@metincansiper See "Indexes on arbitrary ReQL expressions". That might be the easiest option.
It requires that you only use the rdb API for the expression though -- not general JS functions.
@maxkfranz I think indexes based on arbitrary expressions would be needed anyways. Did you mean using indexes on arbitrary expressions to create something like I proposed above (for "abc" indexing to ["a", "b, "c", "ab", "ac", "abc", "A", "B", "C", "Ab", "aB", "AB", ...])?
Edit: As I mentioned I am not sure how efficient it would be in practice. If you think it looks okay I can implement it.
I don't think you'll have to mutate the existing string values, but you'll just have to include all of the string fields in each protein entry.
@maxkfranz actually I am not sure if I am at the same page. What I mean actually is not modifying the existing string values but creating indexes based on substrings of actual strings. As an example:
// create secondary index for "name" field here
r.table('uniprot').indexCreate("subnames", entry => {
// return an array of all substrings here (use uppercases only, so need to use uppercases
// on get query too. E.g. If entry('name') is "abc" return ["A", "B, "C", "AB", "AC", "ABC"]
}, {multi: true})
// make a search query for "name" field here
r.table('uniprot').getAll( toUpperCase(searchText), {index: "subnames"} )
Do you mean even returning the substrings array, as above, is not needed? If so how can I make the search query without using these substrings in return value of my index function (For example it will need to catch "abc" when "bc" is searched)?
BTW I made a commit for the "Step 1".
The separate project has been set up: https://github.com/PathwayCommons/grounding-search
Description
Q: What is the name of the feature?
A: Local grounding services
Q: What does this feature enable the user to do?
A: Ground proteins, genes, chemicals, GO terms
Q: What information must the user provide to use the feature?
A: The name or synonym of the entity
Q: What are the applicable constraints, e.g. compatibility or performance?
A: Must perform at least as quickly as remote services. Ideally, all services should return in less than one second.
Specification
API
element-association
services. Each service exports{ namespace, search, get, distanceFields }
.Background