Closed nrjones8 closed 10 years ago
Problem indeed! I think here is where we probably need to make a chage...
Hmmm.... So a little more research suggests that our tool might be taking care of this issue (probably in the nltk tokenization code. From within FeatureExtractor I did some prints and
print self.text[self.word_spans[0][0]:[self.word_spans[0][1]]
produces Mr.
as we would expect on document part1/suspicious-document01078.txt
.
I was working on the frontend and the indices of plagiarized spans appeared to be a little bit off (by a few characters). Looks like we're reading in a 'xef\xbb\xbf' at the start of the files in our corpus, causing indexing to be off by 3 characters. Looking at our new favorite plagiarism in instrinsic (i.e.
part1/suspicious-document01078.txt
) and its .xml file, we can see that its first case of plag starts at 1396 and goes for 272 characters. However, when reading that file:I haven't checked many (only a few) yet, but this looks like a problem across everything in the corpus. Looks like it's an encoding issue:
http://stackoverflow.com/questions/12561063/python-extract-data-from-file/12561163#12561163 http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
but something we need to deal with!