colinpollock / seinfeld-scripts

Scripts for parsing Seinfeld scripts
http://colinpollock.net/seinfeld-script-data
56 stars 6 forks source link

Season 8 Episode 8 does not parse and load into the DB #2

Closed luzer closed 8 years ago

luzer commented 8 years ago

[love your work!]

still researching the cause... might be related to the fact that this episode was 'corrected'

in file 142.shtml, after parsing, only 6 lines of data appears select * from sentence s, utterance u, episode e where s.utterance_id = u.id and u.episode_id = e.id and season_number = 8 and episode_number = 8;

screen shot 2016-08-19 at 9 26 30 am

screen shot 2016-08-19 at 8 25 38 am

colinpollock commented 8 years ago

Hey @luzer, glad you found it useful! Just curious: what are you using the script data for?

I checked an old sqlite db I had generated, and the problem you identified is there so this appear to have always been a bug. Thanks for creating the issue! Did you find a fix for this? I can take a look, but probably not in the next week.

luzer commented 8 years ago

@colinpollock love this repo! i didnt find a fix yet, but went back in time to see if i could find a 'clean' HTML file for this- but could not. something with the HTML is malformed. https://web.archive.org/web/*/www.seinfeldscripts.com i migrated it to Postgres and am doing sentiment analysis with it, combined with GOIM (https://github.com/BurntSushi/goim), putting it into Tableau.

see details here (http://www.tableaumeaway.com/seinfeld-sentiment-analysis-tableau-v10/)

i tried to tweet you, but but have gotten the wrong colin pollock...

luzer commented 8 years ago

i saw a similar project from @mattniedelman https://github.com/mattniedelman/seinfeld/blob/master/scraper.ipynb that is using a different script file- that might work - http://www.seinology.com/scripts/script-142.shtml vs http://www.seinfeldscripts.com/TheChickenRoaster.htm (which does not even render )

luzer commented 8 years ago

thanks! any way i can just load the missing ep?