Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
40 stars 16 forks source link

More sequence accuracy concerns #156

Closed keien closed 10 years ago

keien commented 10 years ago

Sentence: I would love to do it again.

Sequences missing in new db:

I have no idea why this happens.

keien commented 10 years ago

Here's a similar phenomenon where single-word sequences are missing (this one is from the tweets dataset so I can't cross-check with an old SQL dump):

>>> d
<Sentence: Vid: Rep. Debbie Wasserman Schultz [#FL20]: Discussing the GOP's Pledge for America on CNN.wmv http://bit.ly/dc8u6b #tcot #p2>
>>> for se in d.sequences: print se
... 
<Sequence Vid>
<Sequence Vid :>
<Sequence Vid : Rep.>
<Sequence Vid Rep.>
<Sequence Vid : Rep. Debbie>
<Sequence Vid Rep. Debbie>
<Sequence : Rep.>
<Sequence : Rep. Debbie>
<Sequence : Rep. Debbie Wasserman>
<Sequence Wasserman Schultz -LSB- #FL>
<Sequence Wasserman Schultz -lsb- #FL>
<Sequence Schultz -LSB- #FL>
<Sequence Schultz -lsb- #FL>
<Sequence Schultz -LSB- #FL 20>
<Sequence Schultz -lsb- #FL 20>
<Sequence -LSB- #FL>
<Sequence -lsb- #FL>
<Sequence -LSB- #FL 20>
<Sequence -lsb- #FL 20>
<Sequence -LSB- #FL 20 -RSB->
<Sequence -lsb- #FL 20 -rsb->
<Sequence #FL>
<Sequence #FL 20>
<Sequence #FL 20 -RSB->
<Sequence #FL 20 -rsb->
<Sequence #FL 20 -RSB- :>
<Sequence #FL 20 -rsb- :>
<Sequence 20 -RSB->
<Sequence 20 -rsb->
<Sequence 20 -RSB- :>
<Sequence 20 -rsb- :>
<Sequence 20 -RSB- : Discussing>
<Sequence 20 -RSB- Discussing>
<Sequence 20 -rsb- : discuss>
<Sequence 20 -rsb- discuss>
<Sequence -RSB- :>
<Sequence -rsb- :>
<Sequence -RSB- : Discussing>
<Sequence -RSB- Discussing>
<Sequence -rsb- : discuss>
<Sequence -rsb- discuss>
<Sequence -RSB- : Discussing the>
<Sequence -rsb- : discuss the>
<Sequence : Discussing>
<Sequence : discuss>
<Sequence : Discussing the>
<Sequence : discuss the>
<Sequence : Discussing the GOP>
<Sequence : discuss the GOP>
<Sequence Discussing>
<Sequence discuss>
<Sequence Discussing the>
<Sequence discuss the>
<Sequence Discussing the GOP>
<Sequence Discussing GOP>
<Sequence discuss the GOP>
<Sequence discuss GOP>
<Sequence Discussing the GOP 's>
<Sequence discuss the GOP 's>
<Sequence the GOP>
<Sequence the GOP 's>
<Sequence the GOP 's Pledge>
<Sequence GOP 's>
<Sequence GOP 's Pledge>
<Sequence GOP Pledge>
<Sequence GOP 's Pledge for>
<Sequence 's Pledge>
<Sequence 's Pledge for>
<Sequence 's Pledge for America>
<Sequence Pledge>
<Sequence Pledge for>
<Sequence Pledge for America>
<Sequence Pledge America>
<Sequence Pledge for America on>
<Sequence for America>
<Sequence for America on>
<Sequence for America on CNN.wmv>
<Sequence America on>
<Sequence America on CNN.wmv>
<Sequence America CNN.wmv>
<Sequence America on CNN.wmv http:\/\/bit.ly\/dc8u6b>
<Sequence America CNN.wmv http:\/\/bit.ly\/dc8u6b>
<Sequence on CNN.wmv>
<Sequence on CNN.wmv http:\/\/bit.ly\/dc8u6b>
<Sequence on CNN.wmv http:\/\/bit.ly\/dc8u6b #tcot>
<Sequence CNN.wmv>
<Sequence CNN.wmv http:\/\/bit.ly\/dc8u6b>
<Sequence CNN.wmv http:\/\/bit.ly\/dc8u6b #tcot>
<Sequence CNN.wmv http:\/\/bit.ly\/dc8u6b #tcot #p>
<Sequence http:\/\/bit.ly\/dc8u6b>
<Sequence http:\/\/bit.ly\/dc8u6b #tcot>
<Sequence http:\/\/bit.ly\/dc8u6b #tcot #p>
<Sequence http:\/\/bit.ly\/dc8u6b #tcot #p 2>
<Sequence #tcot #p>
<Sequence #tcot #p 2>

Is this supposed to happen?

abendebury commented 10 years ago

Do we ever get single word sequences?

keien commented 10 years ago

I think we're supposed to, no? I thought that for every word in the sentence, there should be a sequence for it and the following three words.

abendebury commented 10 years ago

We are, I'm curious if we never get one-word sequences or if we don't only under certain circumstances.

keien commented 10 years ago

We do get one-word sequences, but they don't seem to generate for all the sentences that should have them.

keien commented 10 years ago

I think I found the culprit - this line is supposed to be outside the if block. I'm rerunning personals to see if it solves the problem.

abendebury commented 10 years ago

Was that it?

keien commented 10 years ago

yep, looks like sequences are perfect now according to the accuracy checks