Integrate NIMBUS_NLP and QA.py and NimbusDatabase and flask_api.py

mfekadu commented 4 years ago

Objective

Integrate QA.py with NimbusDatabase and flask_api.py

Key Result

Somehow, the magic of Nimbus will just work. 😄

I'm opening this issue for us to discuss.

Comments are welcome!

Details

Additional context

mfekadu commented 4 years ago

Image from iOS

2justinmorgan commented 4 years ago

Are there any inaccuracies to this diagram? I'm not sure how QA receives q_format and format_function. Is it passed from NLP?

mfekadu commented 4 years ago

Good question @2justinmorgan !

some inaccuracies of the diagram are

There should be an arrow from flask_api to QA because the /ask endpoint should use QA to generate the final formatted answer because it will call QA.answer(...) which further will call self._format_answer(..) but before we even attempt to format an answer we need an answer format string (e.g. "[PROFESSOR] can be contacted at [PROFESSOR..email]")
It should be more evident that flask_api will call either NIMBUS_NLP.predict_question(...) or Question_Classifier.?? and then pass in the result of the NLP functions into QA. This step gets the predicted answer format string that we need, which gets passed into QA
Lastly, the diagram could make more explicit that the output of the NLP question classifier functions is a tuple that looks like this:
```
(
input_question, 
normalized_sentence,
entity,
answer
)
```

chidiewenike commented 4 years ago

What additional functionality is needed for QA.py?

mfekadu commented 4 years ago

Here is the additional functionality that `QA.py` is still missing

1. `from database_wrapper import NimbusMySQLAlchemy`

2. instantiate `db = NimbusMySQLAlchemy()`

3. extract out the actual `tag` from the `question_format`

The variable extractor currently returns a tuple https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/nimbus-nlp/NIMBUS_NLP.py#L41-L44

Such that an example tuple might look like

(
   "What is Dr. Foaad Khosmood's email?", 
   "What is [PROF] email?",
   "Dr. Foaad Khosmood",   # or it might be "Foaad Khosmood"
   "[PROF] can be contacted at [PROF..email]"
)

so the tag in this case is [PROF]

4. extract out the actual `prop` from the `answer_format`

In this case the prop is email extracted out of [PROF..email]

5. have a way to map the question `tags` to NimbusDatabase `entities`

For example some tags are

[COURSE]
[PROF]
[SECRET_HIDEOUT]
[MMC]
[DEPT]
[CLUBS]

@chidiewenike We discussed how MMC & DEPT might get consolidated.

The NimbusDatabase entities are held within the object itself notice lines 269 to 276 https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/database_wrapper.py#L267-L279

6. actually get the data by calling something like

db.get_property_from_entity(
   prop="email", 
   entity=db.Professors, 
   entity_string="Foaad Khosmood"
)

But that might return an empty list because our database fields for Professor are

>>> db.Professors.
db.Professors.email             db.Professors.lastName          db.Professors.phoneNumber
db.Professors.firstName         db.Professors.metadata          db.Professors.researchInterests
db.Professors.id                db.Professors.mro(

However the QA.py can be smarter by tokenizing the entity_string

For exmaple

"Foaad Khosmood" => ["Foaad", "Khosmood"]

Next we can try to iterate over each token and call the db.get_property_from_entity on each token

db.get_property_from_entity(
   prop="email", 
   entity=db.Professors, 
   entity_string="Foaad"
)

This could actually return some data because Foaad will match the db.Professors.firstName

db.get_property_from_entity(
   prop="email", 
   entity=db.Professors, 
   entity_string="Khosmood"
)

This could actually return some data because Khosmood will match the db.Professors.lastName

**but we might still run into a problem if the entity_string was foaad khosmood

thus, we could try all: Title Case lowercase ALL CAPS

mfekadu commented 4 years ago

@cameron-toy we would greatly appreciate your insights on these ideas for QA.py

mfekadu commented 4 years ago

@2justinmorgan any thoughts?

mfekadu commented 4 years ago

Also here are some extra details about the `tags`

https://docs.google.com/spreadsheets/d/19GONXKA0p8giQ3Lbm_cBrTasUqIi5RaulctIUhqCKnE/edit#gid=1502272392

mfekadu commented 4 years ago

Here are some extra details about what the Column names/fields (or properties) of each Entity (or database table) are

https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/Entity/Professors.py#L9-L16

https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/Entity/Courses.py#L22-L33

https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/Entity/Clubs.py#L7-L19

https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/Entity/Calendars.py#L7-L15

https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/Entity/Locations.py#L6-L12

https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/Entity/QuestionAnswerPair.py#L21-L28

https://github.com/calpoly-csai/api/blob/423c1c2da15670a55524b457366dff17985ad5a5/Entity/Sections.py#L17-L31

zpdeng commented 4 years ago

extract out the actual prop from the answer_format

In this case the prop is email extracted out of [PROF..email]

@mfekadu, we can have a dictionary that maps tags extracted from answer_format to certain property (could do entity as well depending on format of the tags). [PROF.email] would be email property from Professor table , [Course.term] would be terms offered property from the Course table, etc. Need confirmation if tags are comprehensive and are in the format [entity.prop] . This dictionary can belong in QA.py or database_wrapper.py

have a way to map the question tags to NimbusDatabase entities

Similarly, this could be resolved by a dictionary that maps the tags to certain entities. We can have one big dictionary that map all possible tags and return both the entity and the properties or two dictionaries (one for entity, another for properties) that perform the same functionality.

cameron-toy commented 4 years ago

@mfekadu The biggest issue I see right now is using the extracted tokens to access the database if there's not a direct 1-to-1 match. The best way to mitigate this issue would be to improve get_property_from_entity using the ideas from Issue #62.

I also think tokenizing the extracted data to find a match could be a great idea, and deserves some (rapid) testing.

Finally, I think having the answer format contain [Prof..email] for example could be a problem because it breaks the separation between the database access and answer formatting functions, and also makes it harder to implement multi-variable questions in the future. I've been working on a few basic functions and I believe a partial application model is best.

For example, let's say we have a function _get_trait that, well, gets a single trait from an entity. The signature for such a function might look like:

_get_trait(trait, entity, table, extracted_vars)

However, this is just a helper function. The actual way this would be passed in to a QA object is through the get_trait function, which uses partial application to create a function.

def get_trait(trait, entity, table):
    return functools.partial(_get_trait, trait, entity, table)

So if you wanted to create a QA object that got a professor's phone number, you'd pass in

get_trait('phoneNumber', 'firstName', Professors)

as the db_query for the object.

I also made a pair of functions, _string_sub and string_sub, that let us have a general way to format retrieved answer data.

def _string_sub(a_format, extracted_vars, db_data):
    """
    Substitutes values in a string based off the contents of the extracted_vars
    and db_data dictionaries. Keys from the dictionaries in the a_format string
    will be replaced with their associated value.

    Example input/output:
        a_format: "{professor1_ex}'s office is {office1_db}."
        extracted_vars: {"professor1": "Dr. Khosmood"}
        db_data: {"office1": "14-213"}

        "Dr. Khosmood's office is 14-213"

    Args:
        a_format (str): String to be formatted. Variables to be substituted should
            be in curly braces and end in "_ex" for keys from extracted_vars and "_db"
            for keys from db_data.
        extracted_vars (Extracted_Vars)
        db_data (Db_Data)

    Returns:
        A formatted answer string
    """
    # Adds "_ex" to the end of keys in extracted_vars
    extracted_vars = {
        k + "_ex": v for k, v in extracted_vars.items()
    }
    # Adds "_db" to the end of keys in db_data
    db_data = {
        k + "_db": v for k, v in db_data.items()
    }
    return a_format.format(**extracted_vars, **db_data)

def string_sub(a_format):
    return functools.partial(_string_sub, a_format)

chidiewenike commented 4 years ago

Adding the tags to the return tuple from NIMBUS_NLP.py and a title remover to handle cases where professor, Dr., etc are in the professor's name.

         return_tuple = (input_question, normalized_sentence, 
                         entity, answer)

will be

         return_tuple = (input_question, normalized_sentence, 
                         entity, tag, answer)

Ex:

return_tuple = (
"Where is Professor Khosmood's office?", 
"Where is [PROF]'s office?", 
"Khosmood", 
"PROF", 
"[PROF]'s office is in [PROF..office]."
)

mfekadu commented 4 years ago

NIMBUS_NLP returns a dictionary now. Very nice!

https://github.com/calpoly-csai/api/blob/030bb84460331f001f3adeb3f30502c74a5bf5ee/nimbus-nlp/NIMBUS_NLP.py#L123-L129

mfekadu commented 4 years ago

@cameron-toy

Yes! Tokenizing would be awesome! Let’s definitely do some rapid testing.

What kind of tokenization do you suggest?

on spaces?
on all punctuations?
whatever nltk.tokenize does?
all of the above?

You made a fine point too that there should be some layer of separation between the manual tagging of the answer formats and our database fields/code/etc! I’ve learned recently that what you’ve described is a software design principle known as “Separation of Concerns.” Smart thinking!

I’m a bit confused on a few points:

what does the return value of get_trait look like?
what is a trait? Is it, for example, the email of a Professor?
what’s the difference between firstName and email when they are both fields in the table?

mfekadu commented 4 years ago


{ 
   "entity"                : "Dr. Foaad Khosmood", 
   "tag"                   : "[PROF]", 
   "normalized entity"     : "Foaad Khosmood", 
   "input question"        : "What is Dr. Foaad Khosmood's email?", 
   "normalized question"   : "What is [PROF]'s email?" 
}

calpoly-csai / api