IanTeo / cs3219-Project

0 stars 0 forks source link

VenueCommand not accurate #22

Closed IanTeo closed 7 years ago

IanTeo commented 7 years ago

When using the venue command to retrieve a venue such as "ICSE",

.contains matching will return stuff like "ITiCSE", "ICSEA", which are obviously not the correct venue. .equals matching will not return stuff like "USER@ICSE", which is obviously the correct venue.

(Both are already considering ignore case)

I propose we split by white space and punctuation [\s\p{Punct}]+ and look through each of these words and find an exact match. That way, we can cover both of these cases I discovered. If any one discovers any weird behavior for VenueCommand, please add on to this issue

Zhiyuan-Amos commented 7 years ago

How bad is this issue given the current dataset ah? Got some examples to show?

IanTeo commented 7 years ago

I think we are getting around 20% extra venues. But it should be subject to venue name.. Like ICSE is a huge offender, but I don't think arXiv has any problems

IanTeo commented 7 years ago

We actually do not need to do anything about it (If we want)

Because we can easily add a filter for "ignore words" like the person can say I want to filter venue by ICSE. After that he gets the set of Papers (I represent them by venue in this set): { "ian@ICSE", "ITiCSE", "ICSEA", "ICSE", "ICSE", "ICSE" }

Once he realizes that there are incorrect venues, he can specify remove: "ITiCSE", "ICSEA". Now we will end up with this set: { "ian@ICSE", "ICSE", "ICSE", "ICSE" }

IanTeo commented 7 years ago

We will have to provide such a feature, because we definitely cannot account for all cases. And I think this can be covered under the "remove" feature.