KunstDerFuge / Q-notebook

14 stars 1 forks source link

Are all anon posts in Q threads in-scope? #3

Closed ctrlcctrlv closed 2 years ago

ctrlcctrlv commented 3 years ago

This relates to #2.

Now that we have that data, I could pretty easily generate something like qallanons.csv in the style of qdrops.csv via simple web scraping. Images omitted as now, of course. I don't think there'd be any ethical concerns to that that weren't already handled in #1. After all, the anons themselves think only Q's posts and those he replies to have value.

qallanons.csv would of course exclude Q drops.

As mentioned in #2, this would bring about the ability for…

people like QOrigins [to] answer questions like "what is the first time someone in a Q thread said X"?

Thoughts? Could you do cool things with this data? I don't know much about data science, but you seem really good at Pandas. I'd only want to do it if cool graphs come out, otherwise why bother :B

KunstDerFuge commented 3 years ago

I mean, I would use the heck out of this, and I can guarantee that some very cool graphs would come of it. I think I even saw a scrape out there of the non-Q posts, maybe from an aggregator? Will check around, but it's probably old data. If you wanted to set up a scraper and parse the HTML and everything, that would be cool, but it sounds like a non-trivial amount of work.

ctrlcctrlv commented 3 years ago

I don't know when I'll have time to do this. I should some day though, and keeping this open keeps it in my Issues tab :-)

If anyone wants to do it before me, here's all you would need to do:

Honestly, @KunstDerFuge, to me this sounds like a good Fiverr job lol. Should I just pay someone to do it?

KunstDerFuge commented 3 years ago

So, I'm not above grunt work, and I've got like, TOO much time, so I could probably tackle this. One problem that strikes me right away is, it seems like we'd have to scrape a 4chan archive, an 8chan archive, and an 8kun archive (unless there exists some combined archive).

I already happen to have made a 4plebs scraper (or really, a script that interacts with the 4plebs API) that is capable of taking a list of threads and returning full JSON data of the thread, which would probably be enough for QOrigins' use case, and could be easily converted to CSV, probably dropping and renaming columns as needed to match the schema of qdrops.csv:

{
                "144786493": {
                    "doc_id": "138163657",
                    "num": "144786493",
                    "subnum": "0",
                    "thread_num": "144785991",
                    "op": "0",
                    "timestamp": 1610075491,
                    "timestamp_expired": "0",
                    "capcode": "N",
                    "email": null,
                    "name": "Anonymous",
                    "trip": null,
                    "title": null,
                    "comment": "I like how he's consistently wrong about Trump every time and then after Trump does what everyone intelligent could see he was gonna do Fuentes just gets pissy before falling for it the next time",
                    "poster_hash": null,
                    "poster_country": null,
                    "sticky": "0",
                    "locked": "0",
                    "deleted": "0",
                    "nreplies": null,
                    "nimages": null,
                    "fourchan_date": "1/7/21(Thu)22:11",
                    "comment_sanitized": "I like how he's consistently wrong about Trump every time and then after Trump does what everyone intelligent could see he was gonna do Fuentes just gets pissy before falling for it the next time",
                    "comment_processed": "I like how he's consistently wrong about Trump every time and then after Trump does what everyone intelligent could see he was gonna do Fuentes just gets pissy before falling for it the next time",
                    "formatted": false,
                    "title_processed": null,
                    "name_processed": "Anonymous",
                    "email_processed": null,
                    "trip_processed": null,
                    "poster_hash_processed": null,
                    "poster_country_name": false,
                    "poster_country_name_processed": "",
                    "exif": null,
                    "troll_country_code": false,
                    "troll_country_name": "",
                    "since4pass": null,
                    "unique_ips": null,
                    "extra_data": {
                        "since4pass": null,
                        "uniqueIps": null
                    },
                    "media": null,
                    "board": {
                        "name": "Television & Film",
                        "shortname": "tv"
                    }
}

Does a similar API exist for 8chan / 8kun archives? Or else, I can process HTML with BeautifulSoup as you said.

Unique thread IDs can easily be produced with some Pandas magic:

image

Here is a gist of the results, I notice just one tiny issue, it seems your script has inserted "undefined" as the thread IDs for indices 230 and 233. Those are currently encoded as NaN in qdrops.csv.

KunstDerFuge commented 3 years ago

Oops, of course we also want board information. Updated the gist

image

ctrlcctrlv commented 3 years ago

The good news is that 8chan / 8kun's HTML are essentially identical in every way that matters to a scraper. So you would need to write only one more scraper.

And, being a knockoff of 4chan, it's a very similar HTML structure anyway.

I doubt that most of the 8chan API responses would have been archived, but 8chan did/does(?) have a 4chan API-compatible JSON API.

KunstDerFuge commented 3 years ago

Just spent a few hours in a rabbit hole of trying to scrape 8kun, to no avail. Finally checked out the Sqraper code for insight, and found a note about LokiNet:

As of the writing of this, you will have to install LOKINET from https://loki.network/ and run this script with "lokiKun" set to true. This is because 8kun DDoS protection is blocking scripts.

I don't know what that is but sounds kinda shady. IDK, I'm about ready to hand it over to Fiverr if you want. I can pay!

I should be able to handle all the 4chan posts, we just need 8chan and 8kun. Of course, bonus points if this results in a script for future reproducibility and all that. Unless there is a scrapeable archive out there (archive.is didn't like a Scrapy spider I made, kept redirecting to a Captcha), this may be out of my wheelhouse.

Edit:

I was able to get the 4chan anon posts through the 4plebs API. We can at least get a taste of what's possible with this!

KunstDerFuge commented 3 years ago

I'm taking another crack at making a scraper! Tried the other day to scrape from archive.is again, and it was a total failure. I haven't figured out why yet, but even setting Scrapy to use one concurrent connection and wait 5 seconds between requests, all responses come back with status 429: Too Many Requests. I could speculate on why that is but really I have no idea.

However, I'm having success scraping directly from 8kun. I've made a simple Scrapy spider that parses out every post in a thread into the fields we're interested in. Still to do:

Here's the spider I'm using, and a sample of the results (sorry for the screenshot of text). This can be run by installing Scrapy and using the command scrapy runspider 8kun-spider.py -o output.csv -t csv. It also depends on this JSON file of the relevant Q threads to be in the same directory.

image

One end goal we've been talking about within the Q Origins Project is to use this data to make a read-only chan-type website that would be a lot easier for some researchers to use than raw CSVs or making them seek out the original threads. Could make some cool pythonic search functions and stuff -- I've started that project in another repo using Django.

KunstDerFuge commented 3 years ago

I have scraped all (493,996) posts from 8kun Q threads! I parsed the HTML into as close a replica of 8kun's chan markup syntax as I was able with this code:

def parse_formatting(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Process green text
    for result in soup.find_all(attrs={'class': 'quote'}):
        result.insert(0, '> ')

    # Process red text
    for result in soup.find_all(attrs={'class': 'heading'}):
        result.insert_before('==')
        result.insert_after('==')

    # Process bold text
    for result in soup.find_all('strong'):
        result.insert_before("'''")
        result.insert_after("'''")

    # Process italic text
    for result in soup.find_all('em'):
        if result.get_text() != '//': # For some reason, the // in URLs is wrapped with <em />
            result.insert_before("''")
            result.insert_after("''")

    # Process underlined text
    for result in soup.find_all('u'):
        result.insert_before("__")
        result.insert_after("__")

    # Process strikethrough text
    for result in soup.find_all('s'):
        result.insert_before("~~")
        result.insert_after("~~")

    # Process spoiler text
    for result in soup.find_all(attrs={'class': 'spoiler'}):
        result.insert_before("**")
        result.insert_after("**")

    final_text = '\n'.join([line.get_text() for line in soup.find_all(attrs={'class': 'body-line'})])
    return final_text

This produces results that look like this: image

I think greentext may be slightly inaccurate (using > this style instead of >this, because I wasn't sure how to handle quoted >>replies), but it should be good enough for now.

All that remains now is scraping the 8chan threads from an archive site. I'll try running this scraper from a cloud server, in the off chance that the 429 responses I'm getting are due to some kind of IP ban.

ctrlcctrlv commented 3 years ago

429 means Too Many Requests.

Meaning, the site is basically identically configured to how it used to be. Literally just slow down. Do it in one thread, wait 1 second per request. If you keep getting 429, increase the wait to 5 seconds.

KunstDerFuge commented 3 years ago

Yeah, something weird is going on with that. I'm getting 429 even on the first request, and even if I increase the wait to 15 seconds, and even when running on a server, so I'm guessing archive.is is detecting the bot traffic? Maybe I need to mess with User-Agent? It's definitely beyond the scope of my current knowledge but I'll keep looking into it.

ctrlcctrlv commented 3 years ago

Oh, I thought it was 8kun's server. Archive.is could work any way. What I would do in your position is copy all the headers Firefox sends and try using those.

KunstDerFuge commented 3 years ago

That is definitely a good lead, and I found that there's an exchange of cookies going on that seems relevant (presumably TMR = Too Many Requests).

image

So I have naively copy-pasted the cookie into the request headers as a string, but still no luck. For one thing, 'tmr_reqNum' is getting incremented between requests in Firefox, maybe that's got something to do with it. Scrapy has to be able to receive cookies the way a browser does, I think, but not really sure why it isn't. (Sorry, I am very nearly useless with this kind of stuff) image

ctrlcctrlv commented 3 years ago

This is a problem for Selenium.

KunstDerFuge commented 3 years ago

Brilliant! Yes, this is working! Thanks for your patience and insight; there's a chance maybe we'll get this issue closed out by end of (my) day.

ctrlcctrlv commented 3 years ago

Good news—I'm glad my advice is helpful, at least, even if I don't have the time/energy to do it with/for you :+1:

KunstDerFuge commented 3 years ago

Of course, I couldn't have done it without you! Update: Selenium scraper is much slower, but going slow and steady. Exceptions here are because the newest archived version was after 8chan's shutdown, so will need to go back through and manually re-scrape some of these. Will let this run overnight, then just gotta write one more parser, and this should be all good to go.

image

KunstDerFuge commented 3 years ago

'Nother update. I ran into some delays. Archive.is is very difficult to scrape, but is is possible with Selenium. It seems like for unknown reasons, the site will sometimes redirect you to a Captcha challenge after every 15-20 requests, which can be bypassed by starting a new session, or sometimes it will require you to do a Captcha before any additional requests can be made (which happened to one of the servers I tried it on). This happens even if requests are spaced apart by more than 15 seconds.

I've made a script that handles restarting the session if given a Captcha redirect, and which inserts random pauses in between requests. This can be modified somewhat easily to split the job across servers if needed. Hopefully my next update here will be that this is all done, but running it on one machine, I estimate it will take something like 11 hours to complete.

Edit: OOOH! Hot tip from @QOrigins. If you request an 8chan /qresearch/ thread on 8kun, it works. This means that the bulk of the 8chan scraping can still use the much faster 8kun spider, which runs in 10-15 threads with no problem.

2nd Edit: Added 8chan_qresearch.csv. This represents >90% of all Q threads on 8chan, and I am pretty exhausted of all this scraping, so this will be it for right now. I do have the other boards scraped from archive.is, but will need to write another parser to get them into a usable format.

KunstDerFuge commented 2 years ago

Closing because the issue is solved by the related project spawned by this issue, KunstDerFuge/dChan, where all anon posts on Q threads and many more are now archived thanks to work here!