QUT-Digital-Observatory / coordination-network-toolkit

A small command line tool and set of functions for studying coordination networks in Twitter and other social media data.
MIT License
72 stars 14 forks source link

ValueError: not enough values to unpack (expected 8, got 6) #40

Closed bkrdmr closed 2 years ago

bkrdmr commented 2 years ago

Hello, I have been experimenting with data from different social media platforms. I am following your guidelines in this repo. lately, I've tried processing youtube comments. so reply_id and urls columns are empty. I am seeing the following ValueError in preprocessing phase. do you have any suggestions to overcome this?

ValueError                                Traceback (most recent call last)
<ipython-input-16-e8dc9a9d85db> in <module>()
      1 db = "comments.db"
      2 file = "comments.csv"
----> 3 coord_net_tk.preprocess.preprocess_csv_files(db, [file])

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in preprocess_csv_files(db_path, input_filenames)
     20             # Skip header
     21             next(reader)
---> 22             preprocess_data(db_path, reader)
     23 
     24         print(f"Done preprocessing {message_file} into {db_path}")

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in preprocess_data(db_path, messages)
     72         )
     73 
---> 74         for row in processed:
     75             db.execute(
     76                 "insert or ignore into edge values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in <genexpr>(.0)
     69                 urls.split(" ") if urls else [],
     70             )
---> 71             for message_id, user_id, username, repost_id, reply_id, message, timestamp, urls in messages
     72         )
     73 

ValueError: not enough values to unpack (expected 8, got 6)
SamHames commented 2 years ago

so reply_id and urls columns are empty

Are the columns present, but empty (ie, they have comma delimiters but empty strings)? The error looks like the columns aren't present in the file, or have been messed up somehow by the import code.

Out of interest - are you converting JSON data collected via the YouTube API into a CSV and using that? If you can share the code doing the JSON -> CSV conversion (just a gist or something) I might be able to add native support for the format, similar to the Twitter format.

bkrdmr commented 2 years ago

Columns are present. I've tried with dummy values but got the same result. Data is stored in regular dbs of my lab. I extracted it as csv files, re-ordered the columns in pandas per your guideline, before saving it to a new csv for preprocessing.

df = df[['comment_id', 'commenter_id', 'commenter_name', 'video_id', 'reply_to', 'comment_displayed', 'published_date']]
df['urls'] = ""
df['reply_to'] = ""
df['published_date'] = pd.to_datetime(df['published_date'])
df['published_date'] = (df['published_date'] - pd.Timestamp("1970-01-01 00:00:00+00:00")) // pd.Timedelta('1s')
df.to_csv('comments.csv', index=False, encoding='utf-8')
SamHames commented 2 years ago

Thanks for confirming - I'll try and take a look at what's going on today or tomorrow.

bkrdmr commented 2 years ago

Thank you! will check again.

SamHames commented 2 years ago

I had a quick look into this - I wonder if the problem is the CSV file is being misinterpreted within the toolkit?

I think two things to try are:

  1. Try working with just a small sample of rows - if it works for that it's probably a representation problem with specific rows. df.head().to_csv('comments.csv', index=False, encoding='utf-8')
  2. Quote all fields in the CSV output: df.to_csv('comments.csv', index=False, encoding='utf-8', quoting=1)

If either of those don't help, I might ask you to share an example file with me so I can debug for you.

SamHames commented 2 years ago

Alternatively, since you're already writing Python, you can cut out the CSV middle man and work directly from the dataframe via the toolkit as a Python library. These functions are safe to use and aren't expected to change, I just haven't had any time to write documentation apart from the snippet in the readme.

from coordination_network_toolkit.preprocess import preprocess_data

# Create a generator of pandas rows, since iterrows returns an index and the row content
rows = (row for (i, row) in df.iterrows())

preprocess_data('youtube_comments.db', rows)
bkrdmr commented 2 years ago

Yes. It is now working. Using preprocess_data() solved the issue. I guess something was wrong in the csv. I wonder why you chose directed graphs instead of undirected graphs for co-retweet behavior though.

Thank you for the prompt response and quick fix. This is definitely helpful.