Closed mfekadu closed 4 years ago
@cameron-toy I would appreciate your insights on the data to help resolve this issue. Perhaps the QA.py
can be cognizant of false positives and do an extra level of filtering?
@mfekadu The reason we are getting false positives is because the get_property_from_entity
function is checking every column/property to match or perform conditionals with the entity_string
. This was due to the fact that we don't know which property/column to look for based solely off of the entity string.
A good (perhaps required) resolution is to have an extra parameter that indicate which column/property that we want to perform the match/conditional by. This could be done either in the API, QA, or ML class and will avoid future false positives cases (ie returning entries about "Jackson Lee" and "George Jackson" when entity_string
is "Jackson")
A good (perhaps required) resolution is to have an extra parameter that indicate which column/property that we want to perform the match/conditional by. This could be done either in the API, QA, or ML class and will avoid future false positives cases (ie returning entries about "Jackson Lee" and "George Jackson" when entity_string is "Jackson")
@zpdeng can you provide an example of your proposed solution for
flask_api.py
, then please mention:
QA.py
, then please mention:
nimbus-nlp
folder, then please mention:
@zpdeng please correct me if I’ve misunderstood, however I think you mean to add an extra parameter into the get_property_from_entity
function because a more strict search might resolve this bug by enforcing that specific questions map directly to a specific query
Have I interpreted you comment correctly?
If so, then here is another perspective.
It is possible that this bug is truly a feature 😅 here’s why...
The NimbusDatabase wrapper has no clue what particular question the user has asked, nor does it need to.
Right now, we’ve created a function that can broadly search the database for a somewhat relevant answer, but what we lack is a way to filter for or sort by the most relevant results.
The Google search engine does a good job of bubbling up the most relevant results. Perhaps something similar to the PageRank algorithm might help us? [1]
A tool known as ElasticSearch implements a relevance scoring algorithm. Perhaps we can implement that or even use ElasticSearch directly? [2] [3] [4] [5] [6]
ElasticSearch also implements a ton of NLP techniques, which we could implement ourselves within our custom searching algorithm like [6]
@adamperlin what are your thoughts?
@mfekadu Yep exactly what you mentioned in the first paragraph. We already have the extra parameter we need called prop
in get_property_from_entity
that does strict search. Referencing to your comment this issue [https://github.com/calpoly-csai/api/issues/65], we need to have QA.py to extract out the actual prop from the answer_format (under functionality 4).
This is could be consider a feature or function that selects all entries that contains the entity_string
in any property. I don't think we need PageRank algorithm or other techniques to resolve this and can all be resolved in the QA module
@mfekadu We could also create an identifying names/aliases column for each entity. That way, get_property_from_entity
only needs to search that column.
Two possible problems I see with that is having to manually enter aliases if we create a new column, or having an existing single column be insufficient for all possible ways to refer to that entity (CS 202 vs. computer science 202 vs. CSC 202 vs. Data Structures etc).
@cameron-toy good thinking -- I think your idea is definitely in line with where we need to go.
@mfekadu Here is what I know for sure: we need to concretely determine how we index entities in the database to finish the MVP.
Quick (but important) point of clarification: When I say entity, I am talking about a row in the Courses, Professors, Clubs, etc. tables. The tables themselves are lists of entities.
I agree with @zpdeng and think you spelled out the issue in your comment: given a string which identifies some entity, say entity_string="Jackson"
, we don't have a good way of determining which fields in the database we should attempt to match with to determine the rows of data (and therefore the entity data) we need to complete a query.
We have to start thinking about indexing. Not all columns in our entity data tables are used in the same way. For instance, each row in the Professors table has firstName
and lastName
columns. We use those columns entirely for indexing purposes; i.e., we are matching against them to find relevant rows, not looking up the data that's in them. In database terminology, they tag specific rows as having data that belongs to an entity.
So, to move forward we need to identify, for each kind of entity, what the tag columns are. These are the ways we are indexing rows, and this varies based on the kind of entities in the data table. For Professor, that's probably firstName
and lastName
, since we identify professors by name. For Courses, probably the name and number, since that is how we are generally referring to courses.
Then, we need to make some kind of indexing scheme that allows us to search for a value across all tag columns, to find the relevant associated rows (the associated entities). And yes, it's possible that multiple rows could be matched. For instance, if someone asks the question: What are Smith's office hours
, and we have entries for John Smith
, Smith Johnson
, and Jack Smith
in the Professor's table, we've only been given enough information to narrow it down to those 3 options.
Getting advice from Dr. Khosmood about this is probably a must at this point.
Sorry for the essay here; I just wanted to make sure we were all on the same page. Let me know if there's anything I can clarify. We definitely need to pick an approach and start on it in order to finish the MVP in time!
Fantastic comments!
Thank you for the essay @adamperlin ! That’s exactly the l kind of intelligent discussion we needed to bring clarity to this issue. You make a fantastic point about fields that are used to identify a particular entity and described it intuitively.
Some questions for further discussion
CREATE INDEX
in MySQL?If the answer to either of the above questions is yes, then the code to resolve this issue would be partially in the Entity/
folder.
Describe the bug
The
get_property_from_entity
is effective at finding/filtering for data that contains some string, but it works gets a lot of false positives.The issue may be that
find_entity_that_contains(entity_string)
To Reproduce
Pre-condition
Create a
Courses
table in MySQLPopulate the table with at least the following
Steps
Expected behavior
When calling
get_property_from_entity('courseName', Courses, 'CPE 101')
the result perhaps should be only['CPE 101. Fundamentals of Computer Science.']
However The results
CPE 333...
andCSC 209...
would be correct responses for the question "What are the prerequisites for CPE 101"Additional context
The results
CPE 333...
andCSC 209...
seem to correspond to the following SQLwhile
CPE 101...
is found by