euagendas / m3inference

A deep learning system for demographic inference (gender, age, and individual/person) that was trained on massive Twitter dataset using profile images, screen names, names, and biographies
http://www.euagendas.org
GNU Affero General Public License v3.0
145 stars 57 forks source link

Feature: Support v2 API data as input #31

Open narcisoyu opened 2 years ago

narcisoyu commented 2 years ago

Hi! I'm doing a research project about Twitter analysis.

I fetched user data by Twitter Academic API (v2), and after usingM3Twitter.transform_jsonl(...) I got the following error:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-23da1cf5d317> in <module>
      5 ,access_token=' ',access_secret=' ')
      6 
----> 7 m3twitter.transform_jsonl(input_file="test.jsonl", output_file="test_result.jsonl")

~/opt/anaconda3/lib/python3.8/site-packages/m3inference/m3twitter.py in transform_jsonl(self, input_file, output_file, img_path_key, lang_key, resize_img, keep_full_size_img)
     48             with open(output_file, "w") as fhOut:
     49                 for line in fhIn:
---> 50                     m3vals = self.transform_jsonl_object(line, img_path_key=img_path_key, lang_key=lang_key,
     51                                                          resize_img=resize_img, keep_full_size_img=keep_full_size_img)
     52                     fhOut.write("{}\n".format(json.dumps(m3vals)))

~/opt/anaconda3/lib/python3.8/site-packages/m3inference/m3twitter.py in transform_jsonl_object(self, input, img_path_key, lang_key, resize_img, keep_full_size_img)
     80             else:
     81                 img_file_resize = img_path
---> 82         elif user["default_profile_image"]:
     83             # Default profile image
     84             img_file_resize = TW_DEFAULT_PROFILE_IMG

KeyError: 'default_profile_image'

I also run the example data provided in m3inference/test/twitter_cache/ and the function runs perfectly.

Then I double-checked the jsonl file, it looks like the two versions of Twitter API (v1 / v2) returns (slightly) different jsonl files (I suppose the example data were made by v1 API). Details please see: https://developer.twitter.com/en/docs/twitter-api/migrate/data-formats/standard-v1-1-to-v2

I'm not sure if my comment makes sense, maybe you could have a look? Thanks in advance!

computermacgyver commented 2 years ago

Thanks, @narcisoyu . You are correct that this code was designed for the v1.1 API.

We've not written ingest code for v2, but it should be straightforward. I'm happy to support you to do this if you're willing to have a go

computermacgyver commented 2 years ago

For anyone who finds this. The "Academic API" is the v2 format.

Ultimately, m3 needs data in the format shown in this example file: https://github.com/euagendas/m3inference/blob/master/test/data.jsonl

We have code to go from the v1.1 API to that format, but do not have code to go from the v2 API output to that format. I'd like to add that but do not have capacity at the moment.