RMHogervorst / cleancode

simple blog about cleaner code and starting with R
https://rmhogervorst.github.io/cleancode/
MIT License
0 stars 0 forks source link

scrape every transcript from GRC - security now #55

Open RMHogervorst opened 7 years ago

RMHogervorst commented 7 years ago

Steve Gibson's Security now podcast has excelent transcripts .

https://www.grc.com/securitynow.htm

The files are very clearly marked and will be very easy to scrape with rvest

https://www.grc.com/sn/past/2005.htm

Like the TNG project we can use the top part of the file for metadata we can then use the transcript itself to extract the sentences per person And put everything into 1 df.

Example: https://www.grc.com/sn/sn-020.txt

GIBSON RESEARCH CORPORATION http://www.GRC.com/

SERIES:     Security Now!
EPISODE:        #20
DATE:       December 29, 2005
TITLE:      A SERIOUS new Windows vulnerability - and Listener Q&A #2
SPEAKERS:   Steve Gibson & Leo Laporte
SOURCE FILE:    http://media.GRC.com/sn/SN-020.mp3
FILE ARCHIVE:   http://www.GRC.com/securitynow.htm

DESCRIPTION:  On December 28th a serious new Windows vulnerability appeared and was immediately exploited by a growing number of malicious web sites to install malware.  Many worse viruses and worms are expected soon.  We start off discussing this, and our show notes provide a quick necessary workaround until Microsoft provides a patch.  Then we spend the next 45 minutes answering and discussing interesting listener questions.

LEO LAPORTE:  This is Security Now! with Steve Gibson, Episode 20, for December 29, 2005.

STEVE GIBSON:  Last episode of this year.

LEO:  The last episode of 2005.  And we've done 20 of them.

STEVE:  Yeah.
RMHogervorst commented 7 years ago

Could create a bot that talks like Steve, (using #54 with https://github.com/abresler/markovifyR package. ) topic model or word2vec

theme detection based on transcripts

easy things like sentiment analyses

network analyses of words