reading corpus files/indexing problems

nrjones8 commented 10 years ago

I was working on the frontend and the indices of plagiarized spans appeared to be a little bit off (by a few characters). Looks like we're reading in a 'xef\xbb\xbf' at the start of the files in our corpus, causing indexing to be off by 3 characters. Looking at our new favorite plagiarism in instrinsic (i.e. part1/suspicious-document01078.txt) and its .xml file, we can see that its first case of plag starts at 1396 and goes for 272 characters. However, when reading that file:

>>> f = file('suspicious-document01078.txt', 'r')
>>> text = f.read()
>>> f.close()
>>> text[1396: 1396 + 272]
't. Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor m'
>>> text[1396 + 3: 1396 + 272 + 3]
'Soon after\nthe last livre was spent, De la Salle had occasion to make a journey\nin connection with his work. He went on foot, as needs he must, and\nbegged his way. An old woman gave him a piece of black bread; he\nate it with joy, feeling that now he was indeed a poor man.'

I haven't checked many (only a few) yet, but this looks like a problem across everything in the corpus. Looks like it's an encoding issue:

http://stackoverflow.com/questions/12561063/python-extract-data-from-file/12561163#12561163 http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

but something we need to deal with!

NoahCarnahan commented 10 years ago

Problem indeed! I think here is where we probably need to make a chage...

NoahCarnahan commented 10 years ago

Hmmm.... So a little more research suggests that our tool might be taking care of this issue (probably in the nltk tokenization code. From within FeatureExtractor I did some prints and print self.text[self.word_spans[0][0]:[self.word_spans[0][1]] produces Mr. as we would expect on document part1/suspicious-document01078.txt.

NoahCarnahan / plagcomps

reading corpus files/indexing problems #17