effectiveaccelerationism / text-to-banger

A simple API converting a user's proposed tweet into a veritable banger.
GNU General Public License v3.0
98 stars 14 forks source link

Data #3

Open Bradley-Butcher opened 11 months ago

Bradley-Butcher commented 11 months ago

How can we scrape bangers for LM training? What sort of criteria are we going for?

My take would be an initial like/follower ratio filter, topic filtering, with some sentiment analysis along the: edgy, meme, up-to-date commentary Pareto front.

Anyone got preemo twitter API access? Let's get Elon on board.

Prob easier to train a CLM to avoid having to obtain (not banger, banger) pairs.

If we're training a CLM Should probably include negative examples also with special \ and \ prefixes to avoid LLM bs. Can use my tweets for the negative bangers.

Should probably engineer some semi-online LoRa to ensure the LM maintains live bangers

martinshkreli commented 11 months ago

yes. all of this. agreed.

Imagineer99 commented 11 months ago

Let's just scrape @goth600's account

codethazine commented 11 months ago

Hey @Bradley-Butcher this is what I was thinking:

  1. Scrape from one's personal Twitter. Currently using my own for testing, since I got an API subscription -> would be better to crowdsource a list of accounts though? Dunno how to do that. @martinshkreli maybe?
  2. Get the top 42 bangers for each followed account with more than 420 followers
  3. Generate 10 boring versions of the banger tweets through OAI API
  4. Finetune through QLoRa on Llama 2, input: boring -> output: banger

Whatcha think

Bradley-Butcher commented 11 months ago

Sounds good, wondering if some larger scale pretraining on just bangers would be advantageous before the lower scale synthetic not banger/ banger conversion

Concerned about potential bias the gpt unbanging would inject but might not be an issue

Could sidestep the synthetic not-banger bias issue by having an intermediate banger format, maybe yaml, something like:

topic: LK-99, UFOs
edgy-magnitude: 8
mention: sam altman
banger: true
 ... other stuff

Use existing LM to extract the yamls, constrain generation according to schema.

Then train a schema -> tweet LM.

cba to design the schema properly, idk what should be in it

keharv commented 11 months ago

This website appears to contain tweet information: https://nitter.net/ This java application appears to grab tweet information from Nitter: https://github.com/yamin8000/TwitterScrapper

Does not appear to get following/followers; what kind of data do we need to collect? @codethazine

martinshkreli commented 11 months ago

i think the other obvious thing to think about is defining a 'banger' in the first place.

something like (log likes) / (log followers) = banger index if a small account gets a huge amount of likes, it has to be a banger if a huge account gets small amount of likes, its trash

we need a twitter API obviously i dont think scraping or using a 3rd party will help. any thoughts?

keharv commented 11 months ago

Fair point, this is something we've discussed in #programming on the discord. This is the current state of the twitter API: https://twitter.com/codethazine/status/1687966569127682049?s=46 Problem is, the API isn't working currently.

Fair point on the definition of a 'banger', codethazine mentioned it last night. Think it should be determined by ratio of followers/likes

I think Nitter can be useful source of scraping (uses twitter backend, javascript free, no account required) if the API continues not to work, will probably have to write our own scraper to capture number of likes on tweets though, as it appears the scraper I linked previously doesn't record the number of likes.

I think the current point we're at is that there isn't an easy way to get a list of accounts a user follows, so we will probably need to curate a list of accounts to scrape for bangers.

realisticattorney commented 11 months ago

if we go with (log likes) / (log followers) we'll soon realize some likes are better than others.

also we'll only get bangers from one-hit banger accounts (can't beat log 40k / log 50) and probably risk overfitting due to sparse data.

picking the highest banger density accounts fixes the latter. But by only filtering based on that, the same 40k followers giving 20k likes per tweet to "Be what you want to attract" accounts are the only thing our LM will be and attract.

So I'd just pick by hand accounts on-the-rise, already followed by all the cool kids, then apply Martin's formula.

codethazine commented 11 months ago

Ok, so I curated a small group of 42 banger accounts on PR #20 - would love some feedback/integrations on that. I'll then proceed to:

  1. Get the follower count and augment the banger_accounts.csv to banger_accounts_w_followers.csv through get_num_followers.py *
  2. Run get_last_100_tweets.py, cycling through the users on data/banger_accounts.csv and dump them on data/last_100_tweets_from_bangerers.csv
  3. Filter the tweets by a TBD followers/likes ratio and exclude any tweet containing links, replies, and RTs through filter_bangers.py, dumping them on final_bangers.csv. If the number of banger_accounts stays at 42, I'm assuming we'll get around 1000 tweets on final_bangers.csv. More than that could be difficult, considering the 10K Twitter API limit on the basic plan.

So banger data coming soon!

codethazine commented 11 months ago

Also, came across this paper detailing LoRa vs full fine-tune. Considering our relatively small dataset, it might be best going for a full time-tune approach

martinshkreli commented 11 months ago

ive secured an API key for this. can we make a twitter group chat? please DM me, and then maybe we can migrate that to discord.

codethazine commented 11 months ago

Some finetuning results on OAI Curie: https://github.com/effectiveaccelerationism/text-to-banger/pull/24#issuecomment-1676015783

I'm pretty delighted with the results act, whatcha think?