Closed mfekadu closed 4 years ago
Are there any inaccuracies to this diagram? I'm not sure how QA receives q_format and format_function. Is it passed from NLP?
Good question @2justinmorgan !
some inaccuracies of the diagram are
flask_api
to QA
because the /ask
endpoint should use QA
to generate the final formatted answer because it will call QA.answer(...)
which further will call self._format_answer(..)
but before we even attempt to format an answer we need an answer format string (e.g. "[PROFESSOR] can be contacted at [PROFESSOR..email]"
)flask_api
will call either NIMBUS_NLP.predict_question(...)
or Question_Classifier.??
and then pass in the result of the NLP functions into QA
. This step gets the predicted answer format string that we need, which gets passed into QA
(
input_question,
normalized_sentence,
entity,
answer
)
What additional functionality is needed for QA.py?
QA.py
is still missingfrom database_wrapper import NimbusMySQLAlchemy
db = NimbusMySQLAlchemy()
tag
from the question_format
The variable extractor currently returns a tuple https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/nimbus-nlp/NIMBUS_NLP.py#L41-L44
Such that an example tuple might look like
(
"What is Dr. Foaad Khosmood's email?",
"What is [PROF] email?",
"Dr. Foaad Khosmood", # or it might be "Foaad Khosmood"
"[PROF] can be contacted at [PROF..email]"
)
so the tag
in this case is [PROF]
prop
from the answer_format
In this case the prop
is email
extracted out of [PROF..email]
tags
to NimbusDatabase entities
For example some tags are
[COURSE]
[PROF]
[SECRET_HIDEOUT]
[MMC]
[DEPT]
[CLUBS]
@chidiewenike We discussed how MMC & DEPT might get consolidated.
The NimbusDatabase entities are held within the object itself notice lines 269 to 276 https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/database_wrapper.py#L267-L279
db.get_property_from_entity(
prop="email",
entity=db.Professors,
entity_string="Foaad Khosmood"
)
But that might return an empty list because our database fields for Professor are
>>> db.Professors.
db.Professors.email db.Professors.lastName db.Professors.phoneNumber
db.Professors.firstName db.Professors.metadata db.Professors.researchInterests
db.Professors.id db.Professors.mro(
However the QA.py
can be smarter by tokenizing
the entity_string
For exmaple
"Foaad Khosmood" => ["Foaad", "Khosmood"]
Next we can try to iterate over each token
and call the db.get_property_from_entity
on each token
db.get_property_from_entity(
prop="email",
entity=db.Professors,
entity_string="Foaad"
)
This could actually return some data because Foaad
will match the db.Professors.firstName
db.get_property_from_entity(
prop="email",
entity=db.Professors,
entity_string="Khosmood"
)
This could actually return some data because Khosmood
will match the db.Professors.lastName
**but we might still run into a problem if the entity_string
was foaad khosmood
Title Case
lowercase
ALL CAPS
@cameron-toy we would greatly appreciate your insights on these ideas for QA.py
@2justinmorgan any thoughts?
tags
extract out the actual
prop
from theanswer_format
In this case the
prop
is[PROF..email]
@mfekadu, we can have a dictionary that maps tags extracted from answer_format
to certain property (could do entity as well depending on format of the tags). [PROF.email]
would be email property from Professor table , [Course.term]
would be terms offered property from the Course table, etc. Need confirmation if tags are comprehensive and are in the format [entity.prop]
. This dictionary can belong in QA.py
or database_wrapper.py
- have a way to map the question
tags
to NimbusDatabaseentities
Similarly, this could be resolved by a dictionary that maps the tags to certain entities. We can have one big dictionary that map all possible tags and return both the entity and the properties or two dictionaries (one for entity, another for properties) that perform the same functionality.
@mfekadu The biggest issue I see right now is using the extracted tokens to access the database if there's not a direct 1-to-1 match. The best way to mitigate this issue would be to improve get_property_from_entity
using the ideas from Issue #62.
I also think tokenizing the extracted data to find a match could be a great idea, and deserves some (rapid) testing.
Finally, I think having the answer format contain [Prof..email] for example could be a problem because it breaks the separation between the database access and answer formatting functions, and also makes it harder to implement multi-variable questions in the future. I've been working on a few basic functions and I believe a partial application model is best.
For example, let's say we have a function _get_trait
that, well, gets a single trait from an entity. The signature for such a function might look like:
_get_trait(trait, entity, table, extracted_vars)
However, this is just a helper function. The actual way this would be passed in to a QA object is through the get_trait
function, which uses partial application to create a function.
def get_trait(trait, entity, table):
return functools.partial(_get_trait, trait, entity, table)
So if you wanted to create a QA object that got a professor's phone number, you'd pass in
get_trait('phoneNumber', 'firstName', Professors)
as the db_query
for the object.
I also made a pair of functions, _string_sub
and string_sub
, that let us have a general way to format retrieved answer data.
def _string_sub(a_format, extracted_vars, db_data):
"""
Substitutes values in a string based off the contents of the extracted_vars
and db_data dictionaries. Keys from the dictionaries in the a_format string
will be replaced with their associated value.
Example input/output:
a_format: "{professor1_ex}'s office is {office1_db}."
extracted_vars: {"professor1": "Dr. Khosmood"}
db_data: {"office1": "14-213"}
"Dr. Khosmood's office is 14-213"
Args:
a_format (str): String to be formatted. Variables to be substituted should
be in curly braces and end in "_ex" for keys from extracted_vars and "_db"
for keys from db_data.
extracted_vars (Extracted_Vars)
db_data (Db_Data)
Returns:
A formatted answer string
"""
# Adds "_ex" to the end of keys in extracted_vars
extracted_vars = {
k + "_ex": v for k, v in extracted_vars.items()
}
# Adds "_db" to the end of keys in db_data
db_data = {
k + "_db": v for k, v in db_data.items()
}
return a_format.format(**extracted_vars, **db_data)
def string_sub(a_format):
return functools.partial(_string_sub, a_format)
Adding the tags to the return tuple from NIMBUS_NLP.py and a title remover to handle cases where professor, Dr., etc are in the professor's name.
return_tuple = (input_question, normalized_sentence,
entity, answer)
will be
return_tuple = (input_question, normalized_sentence,
entity, tag, answer)
Ex:
return_tuple = (
"Where is Professor Khosmood's office?",
"Where is [PROF]'s office?",
"Khosmood",
"PROF",
"[PROF]'s office is in [PROF..office]."
)
@cameron-toy
Yes! Tokenizing would be awesome! Let’s definitely do some rapid testing.
What kind of tokenization do you suggest?
nltk.tokenize
does?You made a fine point too that there should be some layer of separation between the manual tagging of the answer formats and our database fields/code/etc! I’ve learned recently that what you’ve described is a software design principle known as “Separation of Concerns.” Smart thinking!
I’m a bit confused on a few points:
get_trait
look like?firstName
and email
when they are both fields in the table
?
{
"entity" : "Dr. Foaad Khosmood",
"tag" : "[PROF]",
"normalized entity" : "Foaad Khosmood",
"input question" : "What is Dr. Foaad Khosmood's email?",
"normalized question" : "What is [PROF]'s email?"
}
Objective
Integrate QA.py with NimbusDatabase and flask_api.py
Key Result
Somehow, the magic of Nimbus will just work. 😄
I'm opening this issue for us to discuss.
Comments are welcome!
Details
Additional context
NIMBUS_NLP
looks like thishttps://github.com/calpoly-csai/api/blob/b5782d81beffd682287b926a750602a71e29679b/nimbus-nlp/NIMBUS_NLP.py#L17-L44
There may soon be
Question_Classifier
https://github.com/calpoly-csai/api/blob/b5782d81beffd682287b926a750602a71e29679b/nimbus-nlp/NIMBUS_NLP.py#L115-L117
QA.py
looks like thishttps://github.com/calpoly-csai/api/blob/58b026869f328cd31a18e9e32ec09b6cebc906b3/QA.py#L9-L53
NimbusDatabase
has this functionhttps://github.com/calpoly-csai/api/blob/58b026869f328cd31a18e9e32ec09b6cebc906b3/database_wrapper.py#L352-L394
flask_api.py
has the/ask
endpointhttps://github.com/calpoly-csai/api/blob/58b026869f328cd31a18e9e32ec09b6cebc906b3/flask_api.py#L43-L71