Closed IanTeo closed 7 years ago
How bad is this issue given the current dataset ah? Got some examples to show?
I think we are getting around 20% extra venues. But it should be subject to venue name.. Like ICSE is a huge offender, but I don't think arXiv has any problems
We actually do not need to do anything about it (If we want)
Because we can easily add a filter for "ignore words" like the person can say I want to filter venue by ICSE. After that he gets the set of Papers (I represent them by venue in this set): { "ian@ICSE", "ITiCSE", "ICSEA", "ICSE", "ICSE", "ICSE" }
Once he realizes that there are incorrect venues, he can specify remove: "ITiCSE", "ICSEA". Now we will end up with this set: { "ian@ICSE", "ICSE", "ICSE", "ICSE" }
We will have to provide such a feature, because we definitely cannot account for all cases. And I think this can be covered under the "remove" feature.
When using the venue command to retrieve a venue such as "ICSE",
.contains matching will return stuff like "ITiCSE", "ICSEA", which are obviously not the correct venue. .equals matching will not return stuff like "USER@ICSE", which is obviously the correct venue.
(Both are already considering ignore case)
I propose we split by white space and punctuation [\s\p{Punct}]+ and look through each of these words and find an exact match. That way, we can cover both of these cases I discovered. If any one discovers any weird behavior for VenueCommand, please add on to this issue