LorenzoPeve / rag-reddit

RAG application using a Reddit knowledge base
0 stars 0 forks source link
data-engineering llms rag

https://www.reddit.com/prefs/apps

https://github.com/reddit-archive/reddit/wiki/OAuth2

Your username is: reddit_bot Your password is: snoo Your app's client ID is: p-jcoLKBynTLew Your app's client secret is: gko_LXELoV07ZBNUXrvWZfzE3aI

reddit@reddit-VirtualBox:~$ curl -X POST -d 'grant_type=password&username=reddit_bot&password=snoo' --user 'p-jcoLKBynTLew:gko_LXELoV07ZBNUXrvWZfzE3aI' https://www.reddit.com/api/v1/access_token
{
    "access_token": "J1qK1c18UUGJFAzz9xnH56584l4", 
    "expires_in": 3600, 
    "scope": "*", 
    "token_type": "bearer"
}
In [1]: import requests
In [2]: import requests.auth
In [3]: client_auth = requests.auth.HTTPBasicAuth('p-jcoLKBynTLew', 'gko_LXELoV07ZBNUXrvWZfzE3aI')
In [4]: post_data = {"grant_type": "password", "username": "reddit_bot", "password": "snoo"}
In [5]: headers = {"User-Agent": "ChangeMeClient/0.1 by YourUsername"}
In [6]: response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
In [7]: response.json()
Out[7]: 
    {u'access_token': u'fhTdafZI-0ClEzzYORfBSCR7x3M',
    u'expires_in': 3600,
    u'scope': u'*',
    u'token_type': u'bearer'}

Dev Environment

docker compose -p reddit_stack up -d --build
docker compose -p reddit_stack down --volumes
docker compose down --volumes

Reddit API

Reddit Glossary

Here is a list of the six different types of objects returned from Reddit:

Querying Subreddits

About a subreddit

response = requests.get(
    "https://oauth.reddit.com/r/dataengineering/about",
    headers={
        'Authorization': f"bearer {os.getenv('TOKEN')}",
        "User-Agent": os.getenv('USER_AGENT'),
    },
)

Another informational endpoint:

{
    "kind": "t5",
    "data": {
        "display_name": "dataengineering",
        "header_img": null,
        "title": "Data Engineering",
        "allow_galleries": true,
        "icon_size": null,
        "primary_color": "",
        "active_user_count": 63,
        "icon_img": "",
        "display_name_prefixed": "r/dataengineering",
        "accounts_active": 63,
        "public_traffic": false,
        "subscribers": 218387,
        "user_flair_richtext": [],
        "videostream_links_count": 0,
        "name": "t5_36en4",
        ...
    }
}

Listing submissions a.ka. Posts

Use the following endpoints

Take it with a grain of salt but this is how each endpoint works

Default is hot

These two endpoints are equivalent

t parameter is only possible for top and controversial

Pagination Limitations

Dynamic nature of the platform

Historical Data Accessibility:

Searching a subreddit

Tags

On Reddit, tags are labels used to categorize and organize posts within a subreddit. They help users quickly identify the type of content or the topic of the post.

Below are the tags for the data engineering subreddit. To increase data quality those post tagged as Meme are not included in the dataset.

alt text

💡 Future Work

💡 What about focusing on different