fitnr / twitter_markov

Create markov chain ("_ebooks") accounts on Twitter
GNU General Public License v3.0
59 stars 12 forks source link

Corpus updates and learning [questions] #11

Closed Reapette closed 6 years ago

Reapette commented 6 years ago

Hi! Two questions:

1) when I want to manually update the corpus, would just plain old adding lines of text work fine "out of the box", or would I need to carry out some re-learn/update process?

2) there is a function that allows to keep learning from another twitter acc

Documentation states:

# If you want your bot to continue to learn, include this parent: your_screen_name

How does the learning "work"? Do tweets from that account get gradually added to the txt file specified in the config as corpus? Or do they live "someplace else"? How many "parent's tweets" are added "per cron run"?

fitnr commented 6 years ago
  1. Yes, you can manually change the text file whenever you want.

Learning works by reading the tweets of the parent account and adding them to the corpus text file. Learning will happen whenever the command line tool is run, assuming that a parent account has been set in the config file. Learning can be disabled with the --no-learn option.

Reapette commented 6 years ago

@fitnr Thanks! a couple of questions about learning then

1) How deep does it "go" into parent account every time command is run? Like, does it grab the last 10 tweets? Last 20?

2) Also, I reckon specifying the bot's own account as "parent" would cause it to suck its own tweets into the corpus, resulting in slow degradation of quality (with possible increase of fun), would that be correct?

fitnr commented 6 years ago
  1. It looks back since the last time the markov account tweeted. The assumption is that the learning step happens every time the markov tweets.
  2. The part about the quality is conceptually correct, except the way it's set up the learning would never read anything. Another tool than can regularly read the tweets from the markov account and append them to the corpus file would work. Maybe check out twurl. You could do a daily cron task like twurl '/1.1/statuses/user_timeline.json?count=N' | jq -r '.[].text' >> corpus.txt, where N is the number of tweets the markov account makes a day, and jq is a command line json parser.
Reapette commented 6 years ago

@fitnr Thanks!

Two more questions (hopefully not very dumb :) )

When a config has >1 corpus specified, does it (from the point of view of the markov chain formation process) merge them into a single one, as if it was one file?

Does the order of lines in the file practically affect the way the bot treats them (as in, would changing the order in which lines are in the corpus(es) affect probability of it coming up with a particular phrase, given same state-size?)

fitnr commented 6 years ago
  1. If corpus is a list, the bot will read from all of the files listed.
  2. Markov chains randomly recombine text, so the order of lines in a corpus should be immaterial. For questions about the specifics of the Markov implementation used here, see Markovify.
fitnr commented 6 years ago

@Reapette correction: You can specify multiple texts, which will create multiple models, and you can choose to create texts from any of them (perhaps randomly). To create a combined corpus, just create a file that combines all the texts.

Reapette commented 6 years ago

@fitnr Thanks for explaining.

So, specifying two texts instead of one will not cause it to create one model for one big corpus that is "text1.txt + text2.txt", but will create two models for two separate corpuses?

If that's so, having a "treat all text files as one giant corpus, create one big model" mode would be a great enhancement.

It would allow to create a bot that has one huge "main" corpus that is being updated incrementally via adding stuff to a small separate file, which is IMHO more manageable than adding stuff to an already huge file

fitnr commented 6 years ago

I'm not sure if keeping track of many text files is clearly easier than one text file. And since there are many, many ways to create one file on the fly, it doesn't seem like a pressing need. Why not just have a daily or hourly cron job that cats all the source files into a mega-corpus.txt? You can read from that with --no-learn, then run the script to learn but not tweet into the smaller file.

One change that I've already added to HEAD is to allow the bot to read from any file-like object, so using the Python would allow you to write a script that reads from an arbitrary source.