mfekadu commented 4 years ago

Describe the bug

The get_property_from_entity is effective at finding/filtering for data that contains some string, but it works gets a lot of false positives.

The issue may be that

the name of the function is wrong.... perhaps consider find_entity_that_contains(entity_string)
something else?

To Reproduce

Pre-condition

Create a `Courses` table in MySQL

CREATE TABLE `Courses` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dept` varchar(5) DEFAULT NULL,
  `courseNum` int(11) DEFAULT NULL,
  `termsOffered` set('F','W','SP','SU','TBD') DEFAULT NULL,
  `units` varchar(5) DEFAULT NULL,
  `courseName` varchar(255) DEFAULT NULL,
  `raw_concurrent_text` text,
  `raw_recommended_text` text,
  `raw_prerequisites_text` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=179 DEFAULT CHARSET=latin1;

Populate the table with at least the following

INSERT INTO `Courses` (`id`, `dept`, `courseNum`, `termsOffered`, `units`, `courseName`, `raw_prerequisites_text`, `raw_concurrent_text`, `raw_recommended_text`)
VALUES
    (2, 'CPE', 101, 'F,W,SP', '4', 'CPE 101. Fundamentals of Computer Science.', 'Appropriate Math Placement Level; or MATH 117 with a grade of C- or better; or MATH 118 with a grade of C- or better; or consent of instructor.', 'NA_CONCURRENT_TEXT', 'NA_RECOMMENDED_TEXT');

INSERT INTO `Courses` (`id`, `dept`, `courseNum`, `termsOffered`, `units`, `courseName`, `raw_concurrent_text`, `raw_recommended_text`, `raw_prerequisites_text`)
VALUES
    (18, 'CPE', 333, 'SP', '4', 'CPE 333. Computer Hardware Architecture and Design.', 'NA_CONCURRENT_TEXT', 'NA_RECOMMENDED_TEXT', 'CPE 101, CPE 233.'),
    (75, 'CSC', 209, 'TBD', '1', 'CSC 209. Problem Solving with Computers.', 'NA_CONCURRENT_TEXT', 'NA_RECOMMENDED_TEXT', 'CSC/CPE 101 or CSC/CPE 108 with a grade of C- or better, or consent of instructor.');

Steps

>>> from database_wrapper import NimbusMySQLAlchemy
>>> import Entity
>>> db = NimbusMySQLAlchemy()
initialized database session
initialized NimbusMySQLAlchemy
>>>
>>> # now let's try to answer the question "What is CPE 101?"
>>>
>>> db.get_property_from_entity(
...     prop="courseName",
...     entity=Entity.Courses.Courses,
...     entity_string="CPE 101"
... )
['CPE 101. Fundamentals of Computer Science.', 'CPE 333. Computer Hardware Architecture and Design.', 'CSC 209. Problem Solving with Computers.']
>>>

Expected behavior

When calling get_property_from_entity('courseName', Courses, 'CPE 101') the result perhaps should be only ['CPE 101. Fundamentals of Computer Science.']

However The results CPE 333... and CSC 209... would be correct responses for the question "What are the prerequisites for CPE 101"

Additional context

The results `CPE 333...` and `CSC 209...` seem to correspond to the following SQL

while `CPE 101...` is found by

mfekadu commented 4 years ago

@cameron-toy I would appreciate your insights on the data to help resolve this issue. Perhaps the QA.py can be cognizant of false positives and do an extra level of filtering?

zpdeng commented 4 years ago

@mfekadu The reason we are getting false positives is because the get_property_from_entity function is checking every column/property to match or perform conditionals with the entity_string. This was due to the fact that we don't know which property/column to look for based solely off of the entity string.

A good (perhaps required) resolution is to have an extra parameter that indicate which column/property that we want to perform the match/conditional by. This could be done either in the API, QA, or ML class and will avoid future false positives cases (ie returning entries about "Jackson Lee" and "George Jackson" when entity_string is "Jackson")

mfekadu commented 4 years ago

A good (perhaps required) resolution is to have an extra parameter that indicate which column/property that we want to perform the match/conditional by. This could be done either in the API, QA, or ML class and will avoid future false positives cases (ie returning entries about "Jackson Lee" and "George Jackson" when entity_string is "Jackson")

@zpdeng can you provide an example of your proposed solution for

if modifying the flask_api.py, then please mention:
- which function?
- what will be an example input?
- what will be an example output?
if modifying the QA.py, then please mention:
- which function?
- what will be an example input?
- what will be an example output?
if modifying the code inside of the nimbus-nlp folder, then please mention:
- which function?
- what will be an example input?
- what will be an example output?

mfekadu commented 4 years ago

@zpdeng please correct me if I’ve misunderstood, however I think you mean to add an extra parameter into the get_property_from_entity function because a more strict search might resolve this bug by enforcing that specific questions map directly to a specific query

Have I interpreted you comment correctly?

If so, then here is another perspective.

It is possible that this bug is truly a feature 😅 here’s why...

The NimbusDatabase wrapper has no clue what particular question the user has asked, nor does it need to.

Right now, we’ve created a function that can broadly search the database for a somewhat relevant answer, but what we lack is a way to filter for or sort by the most relevant results.

The Google search engine does a good job of bubbling up the most relevant results. Perhaps something similar to the PageRank algorithm might help us? [1]

A tool known as ElasticSearch implements a relevance scoring algorithm. Perhaps we can implement that or even use ElasticSearch directly? [2] [3] [4] [5] [6]

ElasticSearch also implements a ton of NLP techniques, which we could implement ourselves within our custom searching algorithm like [6]

tokenization
n-grams
lemmatization
TF-IDF
Lexical Similarity
Semantic Similarity

@adamperlin what are your thoughts?

zpdeng commented 4 years ago

@mfekadu Yep exactly what you mentioned in the first paragraph. We already have the extra parameter we need called prop in get_property_from_entity that does strict search. Referencing to your comment this issue [https://github.com/calpoly-csai/api/issues/65], we need to have QA.py to extract out the actual prop from the answer_format (under functionality 4).

This is could be consider a feature or function that selects all entries that contains the entity_string in any property. I don't think we need PageRank algorithm or other techniques to resolve this and can all be resolved in the QA module

cameron-toy commented 4 years ago

@mfekadu We could also create an identifying names/aliases column for each entity. That way, get_property_from_entity only needs to search that column.

Two possible problems I see with that is having to manually enter aliases if we create a new column, or having an existing single column be insufficient for all possible ways to refer to that entity (CS 202 vs. computer science 202 vs. CSC 202 vs. Data Structures etc).

adamperlin commented 4 years ago

@cameron-toy good thinking -- I think your idea is definitely in line with where we need to go.

@mfekadu Here is what I know for sure: we need to concretely determine how we index entities in the database to finish the MVP.

Quick (but important) point of clarification: When I say entity, I am talking about a row in the Courses, Professors, Clubs, etc. tables. The tables themselves are lists of entities.

I agree with @zpdeng and think you spelled out the issue in your comment: given a string which identifies some entity, say entity_string="Jackson", we don't have a good way of determining which fields in the database we should attempt to match with to determine the rows of data (and therefore the entity data) we need to complete a query.

Where we need to go

We have to start thinking about indexing. Not all columns in our entity data tables are used in the same way. For instance, each row in the Professors table has firstName and lastName columns. We use those columns entirely for indexing purposes; i.e., we are matching against them to find relevant rows, not looking up the data that's in them. In database terminology, they tag specific rows as having data that belongs to an entity.

So, to move forward we need to identify, for each kind of entity, what the tag columns are. These are the ways we are indexing rows, and this varies based on the kind of entities in the data table. For Professor, that's probably firstName and lastName, since we identify professors by name. For Courses, probably the name and number, since that is how we are generally referring to courses.

Then, we need to make some kind of indexing scheme that allows us to search for a value across all tag columns, to find the relevant associated rows (the associated entities). And yes, it's possible that multiple rows could be matched. For instance, if someone asks the question: What are Smith's office hours, and we have entries for John Smith, Smith Johnson, and Jack Smith in the Professor's table, we've only been given enough information to narrow it down to those 3 options.

So how do we do this?

Implement some kind of searching scheme on our own. This would be along the lines of what @cameron-toy was suggesting. This could be quite a lot of work to do well, and we could end up reinventing the wheel. Could we make a pretty crude one in the next few days though? Possibly.
Use a tool like elastic search to do this for us. I haven't had time to experiment with elasticsearch, but from what I've seen it is a compelling possibility. However, it could involve migrating our data over to an entirely different platform which brings a lot of complexities with it.

Getting advice from Dr. Khosmood about this is probably a must at this point.

Sorry for the essay here; I just wanted to make sure we were all on the same page. Let me know if there's anything I can clarify. We definitely need to pick an approach and start on it in order to finish the MVP in time!

mfekadu commented 4 years ago

Fantastic comments!

Thank you for the essay @adamperlin ! That’s exactly the l kind of intelligent discussion we needed to bring clarity to this issue. You make a fantastic point about fields that are used to identify a particular entity and described it intuitively.

Some questions for further discussion

Would these identifying fields be a subset of a tables “candidate key”?
Could we simply try to do the SQLAlchemy equivalent of CREATE INDEX in MySQL?

If the answer to either of the above questions is yes, then the code to resolve this issue would be partially in the Entity/ folder.

calpoly-csai / api

False Positive Results from get_property_from_entity #62

Describe the bug

To Reproduce

Pre-condition

Create a `Courses` table in MySQL

Populate the table with at least the following

Steps

Expected behavior

Additional context

The results `CPE 333...` and `CSC 209...` seem to correspond to the following SQL

while `CPE 101...` is found by

Where we need to go

So how do we do this?

calpoly-csai / api

False Positive Results from get_property_from_entity #62

Describe the bug

To Reproduce

Pre-condition

Create a Courses table in MySQL

Populate the table with at least the following

Steps

Expected behavior

Additional context

The results CPE 333... and CSC 209... seem to correspond to the following SQL

while CPE 101... is found by

Where we need to go

So how do we do this?

Create a `Courses` table in MySQL

The results `CPE 333...` and `CSC 209...` seem to correspond to the following SQL

while `CPE 101...` is found by