LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.05k stars 3.23k forks source link

Scraping Reddit dumps #53

Closed yk closed 1 year ago

yk commented 1 year ago

Reddit could provide a good source for training data, especially since the tree-like structure allows for multiple continuations of a conversation, which is amenable to ranking. Probably, not every subreddit will be ideal, most will just result in "general conversations" but there might be some that are essentially in instruction-reply form, or question-answer form (like r/whatisthisthing).

From Christoph:

basically the idea is : We have a graph with 1 root and many branches and leaves

  1. parse the graph from the jsons
  2. get the paths from the root to the leaves that have the most upvotes & make plain text from them ( we should not get alll, cause then the parts near to the root would have high repetiton ) https://files.pushshift.io/reddit/comments/ https://files.pushshift.io/reddit/comments/sample_data.json
SriPrarabdha commented 1 year ago

I think r/NoStupidQuestions , r/AskReddit , r/answers , r/ExplainLikeImFive and r/AskScience are really good for collecting this kind of data

SriPrarabdha commented 1 year ago

if this issue is not assigned to anyone , I would like to work on it

Proteusiq commented 1 year ago

I am also available to pick this one @SriPrarabdha. We could also work together?

yk commented 1 year ago

Hey, thanks a lot :) I've assigned both of you, feel free to work separately or together.

Remember, we're mainly interested in the scraping and parsing code and some instructions on how to run it all. We have infrastructure to do the data collection and storage, so not really a need on your side to do that part, it's really more about how to obtain and handle the data.

SriPrarabdha commented 1 year ago

@Proteusiq that sounds great! How do you want to get started with this?

Proteusiq commented 1 year ago

@Proteusiq that sounds great! How do you want to get started with this?

I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?

SriPrarabdha commented 1 year ago

Yeah for sure👍

@Proteusiq that sounds great! How do you want to get started with this?

I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?

Proteusiq commented 1 year ago

Path to getting data. I have tested with Postman: We can use requests or httpx Sessions

GET e.g.

https://api.pushshift.io/reddit/search/submission?subreddit=whatisthisthing&size=10

DATA can be gathered in time buckets with before and after params. I will upload a snippet code tomorrow

:exclamation: API params

yk commented 1 year ago

can both of you DM me somehow? discord, twitter, all good :) makes coordination easier

SriPrarabdha commented 1 year ago

can both of you DM me somehow? discord, twitter, all good :) makes coordination easier

Alrighty👍

Proteusiq commented 1 year ago

@SriPrarabdha can you collect initial list of subreddits?

SriPrarabdha commented 1 year ago

I've already shared some the subreddits that we can use and will update if I find some new ones

Proteusiq commented 1 year ago

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

SriPrarabdha commented 1 year ago

Yeah these one

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

SriPrarabdha commented 1 year ago

I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!

yk commented 1 year ago

I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!

upload here or discord.

do you have code for this somewhere in a fork?

SriPrarabdha commented 1 year ago

I have put together the code and JSON file in this repo https://github.com/SriPrarabdha/Reddit-Scrapper But the main problem is parsing one post on a subreddit with 15K comments took around 25 minutes. So even scrapping 1 subreddit completely will take a long time

Proteusiq commented 1 year ago

@SriPrarabdha I think you are after something. We can always make the scrapper faster. Update on https://api.pushshift.io/reddit/comments/

import pandas as pd
from httpx import Client

HEADERS = {"User-Agent": "Prayson W. Daniel <praysonpi@gmail.com>"}
BASE_URI = "https://api.pushshift.io/reddit"

timeout = 60 # seconds
subreddit = "whatisthisthing"
size = 10
score = 20
num_comments = 10 # has no effect

with Client(base_url=BASE_URI, headers=HEADERS) as request:

    print("Fetching submission")
    s = request.get(url="/search/submission",
                    params=params,
                    timeout=timeout)

    print("Fetching comments")
    _ids = ",".join(item.get('id') for item in s.json().get("data"))
    params.update({"ids":_ids})
    c = request.get(url="/search/comment",
                    params=params,
                    timeout=timeout)

# Return only needed columns with `fields`
# merge the submission to the comments

datac = pd.DataFrame(c.json().get('data'))
datas = pd.DataFrame(s.json().get('data'))

I will try downloading files instead from https://files.pushshift.io.

The are huge: RC 2022-10 => 23.8 GB and RS => 9.5.

Proteusiq commented 1 year ago

@yk and @SriPrarabdha: Updates on files: It is possible to get data offline: I download RC and RS files for tests. This is where I am:

import json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen

def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import json
    from pathlib import Path

    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
    for blob in f:
        yield blob

DATA_DIR = Path("../data") submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst") submission_blobs = map(json.loads, submission_objects)

subreddit = "whatisthisthing" num_comments = 10

working on finding a faster or better way to do this

datas_gen = (blob for blob in blobs if (blob["subreddit"] == subreddit and blob["num_comments"] >= num_comments) )

data = pd.DataFrame(datas_gen)



The idea is to get ids and questions from the submission and their comments from comments. Merge and groupby id order by reply time on the comments.
yk commented 1 year ago

looks pretty neat so far, nice work! is there a chance we could use something like typer or so, to make this into a script that takes flags to define things like data location etc?

doroshroman commented 1 year ago

Guys, do you need help to speed up parsing? I can step in and try to help you.

Proteusiq commented 1 year ago

Guys, do you need help speeding up parsing? I can step in and try to help you.

Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?

@yk Yes, we can make a beautiful CLI wrapper. What I have now are just prototypes

doroshroman commented 1 year ago

Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?

Actually, didn't have a chance to work with these libraries. But, It's not late to learn something new

doroshroman commented 1 year ago

Also, what kind of trees you want to build from the json representations?

Proteusiq commented 1 year ago

Also, what kind of trees you want to build from the json representations?

Something like: id "ABC", submission: "What happened to Batman?" In comments, we fetch comments where id = "ABC" sort the comments by time of reply

 id "ABC", submission: "What happened to Batman?"  Time 10:30
 id "ABC", comment: "Because Catwoman happened" Time 10:45
 id "ABC", comment: "No way" Time 10:46

So we have replay as they come in. The tree is from submission -> earliers_comments

Sometimes the comments can branch out to others own comments ...

Updates: Using generator allows me to keep calling and stoping using Jupyter: Getting submission is fast but matching them to comment takes forever

# instead of json
import orjson as json
...

break_point = 100
datas_list = [] 
for blob in blobs:
    if break_point < 0:
        break

    if (blob["subreddit"] == subreddit and 
        blob["num_comments"] >= num_comments):
        print(".", end="")
        break_point -= 1
        datas_list.append(blob)

 ids = set(b.get("id") for b in datas_list)
print(f"number of {ids=}")

com_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
blobc = map(json.loads, com_objects)

## just to see how long it takes to get 10 match :(
break_point = 10
datac_list = [] 
for blob in blobc:
    if blob["subreddit"] != subreddit:
        continue

    if break_point < 0:
        break
    print(".", end="")
    if blob["id"] in ids:
        print("X", end="")
        break_point -= 1
        datac_list.append(blob)
...

could be I am matching on the wrong things. Maybe in the comments, I need parent_id. I will keep one searching

doroshroman commented 1 year ago

I can write the multiprocessing version of this, which can speed up matching, just attach full file with code

Proteusiq commented 1 year ago

I can write the multiprocessing version of this, which can speed up matching, just attach full file with code

Super! I got it working now. In submission, I needed "name", and in comments "parent_id"

Notes: prints are just for debugging… needs to be removed

Full code


import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen

def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import orjson as json
    from pathlib import Path

    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
    for blob in f:
        yield blob

DATA_DIR = Path("../data") submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst") comment_objects = smart_open(DATA_DIR / "RC_2022-10.zst")

submission_blobs = map(json.loads, submission_objects) comment_blobs = map(json.loads, comment_objects)

params

subreddit = "whatisthisthing" num_comments = 10

get 101 submissions with num_comments >= 10

break_point = 100 datas_list = [] for blob in submission_blobs: if break_point < 0: break

if (blob["subreddit"] == subreddit and 
    blob["num_comments"] >= num_comments):
    print(".", end="")
    break_point -= 1
    datas_list.append(blob)

get the ids

ids = set(b.get("name") for b in datas_list) print(f"we have {len(ids)} unique ids"}

this takes long just to get 10

break_point = 10 datac_list = [] for blob in comment_blobs: if blob["subreddit"] != subreddit: continue

if break_point < 0:
    break
if blob["parent_id"] in ids:
    print(".", end="")
    break_point -= 1
    datac_list.append(blob)

merging of data ...

danielpwarren commented 1 year ago

From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT

Proteusiq commented 1 year ago

From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT

That would be perfect 😍: looks like we are reinventing the wheel https://github.com/microsoft/DialoGPT/blob/master/reddit_extractor/src/reddit.py

doroshroman commented 1 year ago
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
import asyncio
from asyncio.events import AbstractEventLoop
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from itertools import tee

def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import orjson as json
    from pathlib import Path

    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
    for blob in f:
        yield blob

def filter_submissions(submission_blobs, subreddit, num_comments):

get 101 submissions with num_comments >= 10

break_point = 100
datas_list = [] 
for blob in submission_blobs:
    if break_point < 0:
        break

    if (blob["subreddit"] == subreddit and 
        blob["num_comments"] >= num_comments):
        print(".", end="")
        break_point -= 1
        datas_list.append(blob)

# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids")

return ids

this takes long just to get 10

def matching(comments_chunk, ids, subreddit): break_point = 10 datac_list = [] for blob in comments_chunk: if blob["subreddit"] != subreddit: continue

    if break_point < 0:
        break
    if blob["parent_id"] in ids:
        print(".", end="")
        break_point -= 1
        datac_list.append(blob)

return datac_list

def generate_chunk(iterable, chunk_len=100): chunk = [] for i, item in enumerate(iterable): if i % chunk_len == 0: yield chunk chunk = [] chunk.append(item)

async def main(ids, subbredit): with ProcessPoolExecutor() as process_pool: loop: AbstractEventLoop = asyncio.get_running_loop() calls = [partial(matching, comment_chunk, ids, subbredit) for comment_chunk in generate_chunk(comment_blobs_copy)] call_coros = []

    for call in calls:
        call_coros.append(loop.run_in_executor(process_pool, call))

    results = await asyncio.gather(*call_coros)

    merged_result = []
    for chunk_result in results:
        merged_result += chunk_result

return merged_result

if name == 'main': DATA_DIR = Path("./data") #Path("../data") submission_objects, comment_objects, comment_objects_copy = tee(smart_open(DATA_DIR / "RC_2009-04.zst"), 3)

submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)
comment_blobs_copy = map(json.loads, comment_objects_copy)

# params
subreddit = "whatisthisthing"
num_comments = 10

ids = filter_submissions(submission_blobs, subreddit, num_comments)

matched_comments = asyncio.run(main(ids, subreddit))
print(matched_comments)
doroshroman commented 1 year ago

Made some refactoring and please update your DATA_DIR and smart_open pathes. If it's still relevant

Also, I think it's better to make bigger chunk_len (about 50000)

emersonium commented 1 year ago

Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and formatted correctly. thanks

yk commented 1 year ago

Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and formatted correctly. thanks

yes I think a common json schema (or parquet, protobuf, or something) totally makes sense. @lewtun what do you think?

Proteusiq commented 1 year ago

To @yk: @danielpwarren has downloaded files from pushshift from 2005-12 to 2021-06 on a local server. He has code adaptation from DialoGPT. We could adopt it.

From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.

image

from that we could get JSON [{question: answer1: answer2: answer3:}, {question: .... }]

SriPrarabdha commented 1 year ago

@yk and @Proteusiq I have made a simple typer CLI application and made it available on PyPI- https://pypi.org/project/reddit-comment-scrapper/ Any Suggestions on how to make it better?

looks pretty neat so far, nice work! is there a chance we could use something like typer or so, to make this into a script that takes flags to define things like data location etc?

yk commented 1 year ago

From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.

sweet, thank you very much! make sure to retain DialoGPT's MIT header :) Once you're done, could you make a PR with the code? @lewtun any comments on how & where?

danielpwarren commented 1 year ago

I've put my modified code and put it up on danielpwarren/reddit-extractor. It's not great and I don't have much time to work on it atm. I'll run it locally with the aforementioned subreddits and post here when it's done. The data currently is output in tsv format and there's an example in the repo.

yk commented 1 year ago

In #282 @andrewm4894 suggests r/amitheasshole

Could be a way to convert this into more structured training data that actually might encode a lot of nuance. There is lots of rules and hurustics to that subreddit such that would could extract or convert it into a sort of soft label type dataset that maybe could be useful. Apologies if this is a dupe as am sure reddit data already on roadmap, more so that there could be a subset of subreddits that could be enriched or transformed in some way to make them even more useful.

andrewm4894 commented 1 year ago

For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?

I can imagine there will be a lot of issues getting created with source suggestions and it could maybe be useful or help cut down on noise if there was some clear "target templates" or something that people could try stick to?

Still only getting up to speed so apologies if this is already done or perhaps might create too much friction right now - thoughts?

yk commented 1 year ago

For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?

probably @lewtun is the person to talk to for this

huu4ontocord commented 1 year ago

@SriPrarabdha or @Proteusiq - can we get a sample set of data (< 100) to see if we can convert into instructions?

huu4ontocord commented 1 year ago

@Proteusiq and @SriPrarabdha checking on status. thank you!

Proteusiq commented 1 year ago

Hej @ontocord

I saw the issue closed, so I assumed that @danielpwarren way was the path forward. @danielpwarren do you have the samples? Otherwise, I could extract from my script Tomorrow.

huu4ontocord commented 1 year ago

@Proteusiq issue is still open.

Anan-Saadi commented 1 year ago

@Proteusiq is this issue still active? if so I'd like to contribute

Proteusiq commented 1 year ago

Yes, it is. We are missing CLI part

michaelbogdan commented 1 year ago

Yeah these one

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

You could add

/r/changemyview
/r/tipofmytongue
/r/askculinary
/r/AskAcademia
/r/AskAnthropology
/r/AskAstronomy
/r/AskElectronics
/r/AskEngineers
/r/AskHistorians
/r/AskPhilosophy
/r/AskPhysics
/r/AskScienceFiction
/r/AskSocialScience
/r/AskStatistics
/r/HomeworkHelp
/r/ChemHelp
/r/Estimation
/r/MathHelp
/r/AskRedditAfterDark
/r/TooAfraidToAsk

Should I research some more?

Anan-Saadi commented 1 year ago

@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow

Proteusiq commented 1 year ago

@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow

Hi, @Anan-Saadi

We are missing two things:

Data Format

CLI

We have a started code already, my work just keeps me busy to complete...

Anan-Saadi commented 1 year ago

@Proteusiq Ok I'll see what I can do in the coming few days