Closed yk closed 1 year ago
I think r/NoStupidQuestions , r/AskReddit , r/answers , r/ExplainLikeImFive and r/AskScience are really good for collecting this kind of data
if this issue is not assigned to anyone , I would like to work on it
I am also available to pick this one @SriPrarabdha. We could also work together?
Hey, thanks a lot :) I've assigned both of you, feel free to work separately or together.
Remember, we're mainly interested in the scraping and parsing code and some instructions on how to run it all. We have infrastructure to do the data collection and storage, so not really a need on your side to do that part, it's really more about how to obtain and handle the data.
@Proteusiq that sounds great! How do you want to get started with this?
@Proteusiq that sounds great! How do you want to get started with this?
I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?
Yeah for sure👍
@Proteusiq that sounds great! How do you want to get started with this?
I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?
Path to getting data. I have tested with Postman: We can use requests or httpx Sessions
GET e.g.
https://api.pushshift.io/reddit/search/submission?subreddit=whatisthisthing&size=10
DATA can be gathered in time buckets with before
and after
params. I will upload a snippet code tomorrow
:exclamation: API params
can both of you DM me somehow? discord, twitter, all good :) makes coordination easier
can both of you DM me somehow? discord, twitter, all good :) makes coordination easier
Alrighty👍
@SriPrarabdha can you collect initial list of subreddits?
I've already shared some the subreddits that we can use and will update if I find some new ones
These ones:
r/NoStupidQuestions
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience
?
Yeah these one
These ones:
r/NoStupidQuestions r/AskReddit r/answers r/ExplainLikeImFive r/AskScience
?
I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!
I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!
upload here or discord.
do you have code for this somewhere in a fork?
I have put together the code and JSON file in this repo https://github.com/SriPrarabdha/Reddit-Scrapper But the main problem is parsing one post on a subreddit with 15K comments took around 25 minutes. So even scrapping 1 subreddit completely will take a long time
@SriPrarabdha I think you are after something. We can always make the scrapper faster. Update on https://api.pushshift.io/reddit/comments/
import pandas as pd
from httpx import Client
HEADERS = {"User-Agent": "Prayson W. Daniel <praysonpi@gmail.com>"}
BASE_URI = "https://api.pushshift.io/reddit"
timeout = 60 # seconds
subreddit = "whatisthisthing"
size = 10
score = 20
num_comments = 10 # has no effect
with Client(base_url=BASE_URI, headers=HEADERS) as request:
print("Fetching submission")
s = request.get(url="/search/submission",
params=params,
timeout=timeout)
print("Fetching comments")
_ids = ",".join(item.get('id') for item in s.json().get("data"))
params.update({"ids":_ids})
c = request.get(url="/search/comment",
params=params,
timeout=timeout)
# Return only needed columns with `fields`
# merge the submission to the comments
datac = pd.DataFrame(c.json().get('data'))
datas = pd.DataFrame(s.json().get('data'))
I will try downloading files instead from https://files.pushshift.io.
The are huge: RC 2022-10 => 23.8 GB and RS => 9.5.
@yk and @SriPrarabdha: Updates on files: It is possible to get data offline: I download RC and RS files for tests. This is where I am:
import json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
DATA_DIR = Path("../data") submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst") submission_blobs = map(json.loads, submission_objects)
subreddit = "whatisthisthing" num_comments = 10
datas_gen = (blob for blob in blobs if (blob["subreddit"] == subreddit and blob["num_comments"] >= num_comments) )
data = pd.DataFrame(datas_gen)
The idea is to get ids and questions from the submission and their comments from comments. Merge and groupby id order by reply time on the comments.
looks pretty neat so far, nice work! is there a chance we could use something like typer
or so, to make this into a script that takes flags to define things like data location etc?
Guys, do you need help to speed up parsing? I can step in and try to help you.
Guys, do you need help speeding up parsing? I can step in and try to help you.
Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?
@yk Yes, we can make a beautiful CLI wrapper. What I have now are just prototypes
Also, what kind of trees you want to build from the json representations?
Also, what kind of trees you want to build from the json representations?
Something like:
id "ABC", submission: "What happened to Batman?"
In comments, we fetch comments where id = "ABC"
sort the comments by time of reply
id "ABC", submission: "What happened to Batman?" Time 10:30
id "ABC", comment: "Because Catwoman happened" Time 10:45
id "ABC", comment: "No way" Time 10:46
So we have replay as they come in. The tree is from submission -> earliers_comments
Sometimes the comments can branch out to others own comments ...
Updates: Using generator allows me to keep calling and stoping using Jupyter: Getting submission is fast but matching them to comment takes forever
# instead of json
import orjson as json
...
break_point = 100
datas_list = []
for blob in blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
ids = set(b.get("id") for b in datas_list)
print(f"number of {ids=}")
com_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
blobc = map(json.loads, com_objects)
## just to see how long it takes to get 10 match :(
break_point = 10
datac_list = []
for blob in blobc:
if blob["subreddit"] != subreddit:
continue
if break_point < 0:
break
print(".", end="")
if blob["id"] in ids:
print("X", end="")
break_point -= 1
datac_list.append(blob)
...
could be I am matching on the wrong things. Maybe in the comments, I need parent_id. I will keep one searching
I can write the multiprocessing version of this, which can speed up matching, just attach full file with code
I can write the multiprocessing version of this, which can speed up matching, just attach full file with code
Super! I got it working now. In submission, I needed "name", and in comments "parent_id"
Notes: prints are just for debugging… needs to be removed
Full code
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import orjson as json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
DATA_DIR = Path("../data") submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst") comment_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
submission_blobs = map(json.loads, submission_objects) comment_blobs = map(json.loads, comment_objects)
subreddit = "whatisthisthing" num_comments = 10
break_point = 100 datas_list = [] for blob in submission_blobs: if break_point < 0: break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
ids = set(b.get("name") for b in datas_list) print(f"we have {len(ids)} unique ids"}
break_point = 10 datac_list = [] for blob in comment_blobs: if blob["subreddit"] != subreddit: continue
if break_point < 0:
break
if blob["parent_id"] in ids:
print(".", end="")
break_point -= 1
datac_list.append(blob)
From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT
From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT
That would be perfect 😍: looks like we are reinventing the wheel https://github.com/microsoft/DialoGPT/blob/master/reddit_extractor/src/reddit.py
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
import asyncio
from asyncio.events import AbstractEventLoop
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from itertools import tee
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import orjson as json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
def filter_submissions(submission_blobs, subreddit, num_comments):
break_point = 100
datas_list = []
for blob in submission_blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids")
return ids
def matching(comments_chunk, ids, subreddit): break_point = 10 datac_list = [] for blob in comments_chunk: if blob["subreddit"] != subreddit: continue
if break_point < 0:
break
if blob["parent_id"] in ids:
print(".", end="")
break_point -= 1
datac_list.append(blob)
return datac_list
def generate_chunk(iterable, chunk_len=100): chunk = [] for i, item in enumerate(iterable): if i % chunk_len == 0: yield chunk chunk = [] chunk.append(item)
async def main(ids, subbredit): with ProcessPoolExecutor() as process_pool: loop: AbstractEventLoop = asyncio.get_running_loop() calls = [partial(matching, comment_chunk, ids, subbredit) for comment_chunk in generate_chunk(comment_blobs_copy)] call_coros = []
for call in calls:
call_coros.append(loop.run_in_executor(process_pool, call))
results = await asyncio.gather(*call_coros)
merged_result = []
for chunk_result in results:
merged_result += chunk_result
return merged_result
if name == 'main': DATA_DIR = Path("./data") #Path("../data") submission_objects, comment_objects, comment_objects_copy = tee(smart_open(DATA_DIR / "RC_2009-04.zst"), 3)
submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)
comment_blobs_copy = map(json.loads, comment_objects_copy)
# params
subreddit = "whatisthisthing"
num_comments = 10
ids = filter_submissions(submission_blobs, subreddit, num_comments)
matched_comments = asyncio.run(main(ids, subreddit))
print(matched_comments)
Made some refactoring and please update your DATA_DIR and smart_open
pathes.
If it's still relevant
Also, I think it's better to make bigger chunk_len
(about 50000)
Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and formatted correctly. thanks
Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and formatted correctly. thanks
yes I think a common json schema (or parquet, protobuf, or something) totally makes sense. @lewtun what do you think?
To @yk: @danielpwarren has downloaded files from pushshift from 2005-12 to 2021-06 on a local server. He has code adaptation from DialoGPT. We could adopt it.
From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.
from that we could get JSON [{question: answer1: answer2: answer3:}, {question: .... }]
@yk and @Proteusiq I have made a simple typer CLI application and made it available on PyPI- https://pypi.org/project/reddit-comment-scrapper/ Any Suggestions on how to make it better?
looks pretty neat so far, nice work! is there a chance we could use something like
typer
or so, to make this into a script that takes flags to define things like data location etc?
From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.
sweet, thank you very much! make sure to retain DialoGPT's MIT header :) Once you're done, could you make a PR with the code? @lewtun any comments on how & where?
I've put my modified code and put it up on danielpwarren/reddit-extractor. It's not great and I don't have much time to work on it atm. I'll run it locally with the aforementioned subreddits and post here when it's done. The data currently is output in tsv format and there's an example in the repo.
In #282 @andrewm4894 suggests r/amitheasshole
Could be a way to convert this into more structured training data that actually might encode a lot of nuance. There is lots of rules and hurustics to that subreddit such that would could extract or convert it into a sort of soft label type dataset that maybe could be useful. Apologies if this is a dupe as am sure reddit data already on roadmap, more so that there could be a subset of subreddits that could be enriched or transformed in some way to make them even more useful.
For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?
I can imagine there will be a lot of issues getting created with source suggestions and it could maybe be useful or help cut down on noise if there was some clear "target templates" or something that people could try stick to?
Still only getting up to speed so apologies if this is already done or perhaps might create too much friction right now - thoughts?
For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?
probably @lewtun is the person to talk to for this
@SriPrarabdha or @Proteusiq - can we get a sample set of data (< 100) to see if we can convert into instructions?
@Proteusiq and @SriPrarabdha checking on status. thank you!
Hej @ontocord
I saw the issue closed, so I assumed that @danielpwarren way was the path forward. @danielpwarren do you have the samples? Otherwise, I could extract from my script Tomorrow.
@Proteusiq issue is still open.
@Proteusiq is this issue still active? if so I'd like to contribute
Yes, it is. We are missing CLI part
Yeah these one
These ones:
r/NoStupidQuestions r/AskReddit r/answers r/ExplainLikeImFive r/AskScience
?
You could add
/r/changemyview
/r/tipofmytongue
/r/askculinary
/r/AskAcademia
/r/AskAnthropology
/r/AskAstronomy
/r/AskElectronics
/r/AskEngineers
/r/AskHistorians
/r/AskPhilosophy
/r/AskPhysics
/r/AskScienceFiction
/r/AskSocialScience
/r/AskStatistics
/r/HomeworkHelp
/r/ChemHelp
/r/Estimation
/r/MathHelp
/r/AskRedditAfterDark
/r/TooAfraidToAsk
Should I research some more?
@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow
@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow
Hi, @Anan-Saadi
We are missing two things:
Data Format
CLI
We have a started code already, my work just keeps me busy to complete...
@Proteusiq Ok I'll see what I can do in the coming few days
Reddit could provide a good source for training data, especially since the tree-like structure allows for multiple continuations of a conversation, which is amenable to ranking. Probably, not every subreddit will be ideal, most will just result in "general conversations" but there might be some that are essentially in instruction-reply form, or question-answer form (like r/whatisthisthing).
From Christoph: