adrianco / meGPT

Apache License 2.0
224 stars 23 forks source link

Process Twitter archive for ingestion #4

Closed adrianco closed 1 month ago

adrianco commented 2 months ago

Since we are looking for answers to questions, the starting point is to filter the tweet archive to only include conversations, tweets that are linked to other tweets. In my case this reduced the size of the file to about 20% of the raw feed. Code to do this was developed by asking ChatGPT to generate it, and it seemed to work. Full chat sequence follows... The final code for the script has been saved to the repo along with the output conversations file.

Prompt: I'd like a script that processes a twitter archive directory structure to remove private content like DMs and leave the public content intact

ChatGPT: Sure, I can help you with that. Here’s a Python script that processes a Twitter archive directory to remove private content like Direct Messages (DMs) while keeping public content intact. This script assumes that the Twitter archive is in the standard directory structure as provided by Twitter's data export tool.

Requirements:

  1. Python installed on your system.
  2. os and json modules (these are standard Python libraries).

Script:

import os
import json

def remove_private_content(archive_dir):
    # Define the paths for the private content files to be removed
    dm_paths = [
        os.path.join(archive_dir, 'data', 'direct-messages.js'),
        os.path.join(archive_dir, 'data', 'direct-messages-group.js')
    ]

    # Remove the Direct Messages files
    for dm_path in dm_paths:
        if os.path.exists(dm_path):
            os.remove(dm_path)
            print(f"Removed {dm_path}")
        else:
            print(f"{dm_path} does not exist")

    # Handle other potential private data files if necessary
    other_private_files = [
        os.path.join(archive_dir, 'data', 'account.js'),  # Contains account-related information
        os.path.join(archive_dir, 'data', 'contact.js'),  # Contains contacts data
        os.path.join(archive_dir, 'data', 'mobile.js')    # Contains mobile data
    ]

    for file_path in other_private_files:
        if os.path.exists(file_path):
            with open(file_path, 'r') as f:
                data = f.read()

            if 'window.YTD.account' in data:
                data = data.replace('window.YTD.account', 'window.YTD.account_public')
            elif 'window.YTD.contact' in data:
                data = data.replace('window.YTD.contact', 'window.YTD.contact_public')
            elif 'window.YTD.mobile' in data:
                data = data.replace('window.YTD.mobile', 'window.YTD.mobile_public')

            with open(file_path, 'w') as f:
                f.write(data)
                print(f"Processed {file_path} for public data only")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        remove_private_content(archive_directory)
    else:
        print("Invalid directory path")

Explanation:

  1. Removing DM Files: The script first attempts to remove the Direct Messages files (direct-messages.js and direct-messages-group.js). These are typically where private conversations are stored.
  2. Processing Other Files: The script looks for other files that may contain private data (like account.js, contact.js, and mobile.js) and modifies their content to indicate they now only contain public data by renaming the JavaScript variable inside the files. This is a simplistic approach and assumes that renaming the variable is enough to "publicize" the content. Depending on the actual content, further processing might be needed.

How to Run:

  1. Save the script as clean_twitter_archive.py.
  2. Open a terminal or command prompt.
  3. Navigate to the directory where you saved the script.
  4. Run the script using python clean_twitter_archive.py.
  5. Enter the path to your Twitter archive directory when prompted.

This script will help you remove private content from your Twitter archive while keeping your public data intact. Adjust the script as needed based on the actual structure and contents of your Twitter archive.

Prompt: remove deleted tweets as well

ChatGPT: To extend the script to remove deleted tweets as well, we'll need to identify where the deleted tweets are stored and ensure they're removed from the archive. Typically, deleted tweets might still appear in the tweets.js file but be marked in a way that indicates they were deleted.

Enhanced Script:

import os
import json

def remove_private_content_and_deleted_tweets(archive_dir):
    # Define the paths for the private content files to be removed
    dm_paths = [
        os.path.join(archive_dir, 'data', 'direct-messages.js'),
        os.path.join(archive_dir, 'data', 'direct-messages-group.js')
    ]

    # Remove the Direct Messages files
    for dm_path in dm_paths:
        if os.path.exists(dm_path):
            os.remove(dm_path)
            print(f"Removed {dm_path}")
        else:
            print(f"{dm_path} does not exist")

    # Handle other potential private data files if necessary
    other_private_files = [
        os.path.join(archive_dir, 'data', 'account.js'),  # Contains account-related information
        os.path.join(archive_dir, 'data', 'contact.js'),  # Contains contacts data
        os.path.join(archive_dir, 'data', 'mobile.js')    # Contains mobile data
    ]

    for file_path in other_private_files:
        if os.path.exists(file_path):
            with open(file_path, 'r') as f:
                data = f.read()

            if 'window.YTD.account' in data:
                data = data.replace('window.YTD.account', 'window.YTD.account_public')
            elif 'window.YTD.contact' in data:
                data = data.replace('window.YTD.contact', 'window.YTD.contact_public')
            elif 'window.YTD.mobile' in data:
                data = data.replace('window.YTD.mobile', 'window.YTD.mobile_public')

            with open(file_path, 'w') as f:
                f.write(data)
                print(f"Processed {file_path} for public data only")

    # Process tweets.js to remove deleted tweets
    tweets_path = os.path.join(archive_dir, 'data', 'tweets.js')
    if os.path.exists(tweets_path):
        with open(tweets_path, 'r') as f:
            content = f.read()

        # Strip off the JS variable assignment to parse JSON
        json_content = content[content.index('['): content.rindex(']') + 1]
        tweets = json.loads(json_content)

        # Filter out deleted tweets
        public_tweets = [tweet for tweet in tweets if not tweet.get('tweet', {}).get('is_deleted')]

        # Convert back to JSON and re-add the JS variable assignment
        new_content = 'window.YTD.tweets.part0 = ' + json.dumps(public_tweets, indent=2) + ';'

        with open(tweets_path, 'w') as f:
            f.write(new_content)
            print(f"Processed {tweets_path} to remove deleted tweets")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        remove_private_content_and_deleted_tweets(archive_directory)
    else:
        print("Invalid directory path")

Explanation:

  1. Removing DM Files: This part remains unchanged and removes files containing direct messages.
  2. Processing Other Files: This part remains unchanged and handles other private data files by renaming their variables.
  3. Removing Deleted Tweets:
    • The script opens the tweets.js file and reads its content.
    • It extracts the JSON part from the file content, which is enclosed within square brackets.
    • The script then filters out tweets marked as deleted (is_deleted flag).
    • It writes the filtered tweets back to tweets.js while preserving the JavaScript variable assignment at the beginning.

How to Run:

  1. Save the script as clean_twitter_archive.py.
  2. Open a terminal or command prompt.
  3. Navigate to the directory where you saved the script.
  4. Run the script using python clean_twitter_archive.py.
  5. Enter the path to your Twitter archive directory when prompted.

This enhanced script will help you remove both private content and deleted tweets from your Twitter archive, ensuring only your public and active tweets remain. Adjust the script further based on the exact structure of your Twitter archive if necessary.

Prompt: I'd like to process the tweets.js file from a twitter archive to extract all the conversations, and ignore standalone tweets and retweets.

ChatGPT: To process the tweets.js file from a Twitter archive and extract all the conversations while ignoring standalone tweets and retweets, you can follow these steps:

  1. Parse the tweets.js file to extract the tweet data.
  2. Identify tweets that are part of conversations. These tweets typically have a in_reply_to_status_id field.
  3. Ignore standalone tweets and retweets.
  4. Save the conversation tweets to a new file or print them out.

Here’s a Python script to achieve this:

import os
import json

def extract_conversations(tweets_path, output_path):
    if os.path.exists(tweets_path):
        with open(tweets_path, 'r') as f:
            content = f.read()

        # Strip off the JS variable assignment to parse JSON
        json_content = content[content.index('['): content.rindex(']') + 1]
        tweets = json.loads(json_content)

        # Extract tweets that are part of conversations
        conversations = [tweet for tweet in tweets if tweet.get('tweet', {}).get('in_reply_to_status_id')]

        # Save conversations to a new file
        with open(output_path, 'w') as f:
            json.dump(conversations, f, indent=2)
            print(f"Extracted {len(conversations)} conversation tweets to {output_path}")
    else:
        print(f"{tweets_path} does not exist")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    tweets_file = os.path.join(archive_directory, 'data', 'tweets.js')
    output_file = os.path.join(archive_directory, 'data', 'conversations.json')

    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        extract_conversations(tweets_file, output_file)
    else:
        print("Invalid directory path")

Explanation:

  1. Reading the Tweets File: The script opens tweets.js and reads its content.
  2. Parsing the JSON Data: It extracts the JSON array from the JavaScript file by removing the surrounding JavaScript code.
  3. Filtering Conversations: The script filters tweets to find those that are part of conversations by checking the in_reply_to_status_id field.
  4. Saving the Results: The filtered conversation tweets are saved to a new file named conversations.json.

How to Run:

  1. Save the script as extract_conversations.py.
  2. Open a terminal or command prompt.
  3. Navigate to the directory where you saved the script.
  4. Run the script using python extract_conversations.py.
  5. Enter the path to your Twitter archive directory when prompted.

This script will extract all the conversation tweets from your Twitter archive and save them to a new file, ignoring standalone tweets and retweets. Adjust the script as necessary based on the exact structure of your Twitter archive.

Prompt: My Twitter archive also has a tweets-part1.js file. How it’s this related to the tweets.js file?

ChatGPT: In a Twitter archive, the presence of files like tweets.js, tweets-part1.js, tweets-part2.js, etc., indicates that your tweet data has been split into multiple parts due to the large number of tweets. Twitter splits these files to make it easier to manage and process the data without hitting file size limits.

Here's a brief explanation of these files:

Understanding File Structure

Each file typically has the following structure:

window.YTD.tweets.part0 = [ ... ];
window.YTD.tweets.part1 = [ ... ];
window.YTD.tweets.part2 = [ ... ];

The variable window.YTD.tweets.partX indicates which part of the tweet data is being represented. The JSON array following the assignment contains tweet objects.

Updated Script to Handle Multiple Files

Here’s an updated Python script to process all tweet parts (tweets.js, tweets-part1.js, tweets-part2.js, etc.) and extract only conversation tweets:

import os
import json
import glob

def extract_conversations(archive_dir, output_file):
    tweet_files = glob.glob(os.path.join(archive_dir, 'data', 'tweets*.js'))
    conversations = []

    for tweet_file in tweet_files:
        if os.path.exists(tweet_file):
            with open(tweet_file, 'r') as f:
                content = f.read()

            # Find the JSON array within the JavaScript file
            json_content = content[content.index('['): content.rindex(']') + 1]
            tweets = json.loads(json_content)

            # Extract tweets that are part of conversations
            conversation_tweets = [
                tweet for tweet in tweets 
                if tweet.get('tweet', {}).get('in_reply_to_status_id') or tweet.get('tweet', {}).get('in_reply_to_user_id')
            ]
            conversations.extend(conversation_tweets)
            print(f"Processed {tweet_file}: Found {len(conversation_tweets)} conversation tweets")

    # Save conversations to a new file
    with open(output_file, 'w') as f:
        json.dump(conversations, f, indent=2)
        print(f"Extracted a total of {len(conversations)} conversation tweets to {output_file}")

if __name__ == "__main__":
    archive_directory = input("Enter the path to your Twitter archive directory: ")
    output_filename = "conversations.json"

    if os.path.exists(archive_directory) and os.path.isdir(archive_directory):
        output_path = os.path.join(archive_directory, 'data', output_filename)
        extract_conversations(archive_directory, output_path)
    else:
        print("Invalid directory path")

Explanation

  1. Using glob: The script uses the glob module to find all files matching the pattern tweets*.js in the archive's data directory. This ensures it processes all tweet parts, including tweets.js, tweets-part1.js, tweets-part2.js, etc.

  2. Parsing JSON Content: The script reads each file's content, strips out the JavaScript variable assignment, and extracts the JSON array.

  3. Identifying Conversations: It filters tweets that are part of conversations. This includes:

    • Tweets with a non-null in_reply_to_status_id (indicating they are replies).
    • Tweets with a non-null in_reply_to_user_id (indicating they are replies to another user's tweet).
  4. Storing Conversations: The filtered conversation tweets are appended to a list, which is then saved to conversations.json.

How to Run

  1. Save the script as extract_conversations.py.
  2. Open a terminal or command prompt.
  3. Navigate to the directory where you saved the script.
  4. Run the script using python extract_conversations.py.
  5. Enter the path to your Twitter archive directory when prompted.

This script will effectively extract all conversation tweets from multiple parts of your Twitter archive, ensuring you have a consolidated list of conversations across all your tweet data files.

adrianco commented 2 months ago

Here's the full ChatGPT transcript https://chatgpt.com/share/8e0742c3-a44b-4fe4-979a-3096c26a588e

adrianco commented 1 month ago

Working well enough for now. Conversations saved into virtual_adrianco