Tweet serialization & deserialization looses (at least) data about media

M4rtinK commented 7 years ago

I've recently implemented tweet caching in my Twitter app, based on the AsDict() for serialization & NewFromJsonDict() for deserialization.

But I've noticed an inconsistency - if the serialized tweet has media, the media description is correctly serialized to the dict by AsDict(), but when the tweet is deserialized with NewFromJsonDict() the media property of the new Status instance is None.

I've also tried passing the dict to the Status constructor instead of kwargs, but the media is not None, but instead contains a list of dicts describing the media, not the expected list Media instances.

This is a short reproducer demonstrating the issue:

#!/usr/bin/python3

import twitter

as_dict_output = {'id_str': '874353511328169984', 'media': [{'media_url': 'http://pbs.twimg.com/media/DCJTgZ_UIAAwFnw.jpg', 'url': 'https://t.co/JDv3Iz9L54', 'type': 'photo', 'display_url':
 'pic.twitter.com/JDv3Iz9L54', 'expanded_url': 'https://twitter.com/Space_Station/status/874353511328169984/photo/1', 'media_url_https': 'https://pbs.twimg.com/media/DCJTgZ_UIAAwFnw.jpg', '
sizes': {'medium': {'resize': 'fit', 'w': 1200, 'h': 1149}, 'large': {'resize': 'fit', 'w': 2048, 'h': 1960}, 'thumb': {'resize': 'crop', 'w': 150, 'h': 150}, 'small': {'resize': 'fit', 'w'
: 680, 'h': 651}}, 'id': 874353093860663296}, {'media_url': 'http://pbs.twimg.com/media/DCJThr7UwAE9rCo.jpg', 'url': 'https://t.co/JDv3Iz9L54', 'type': 'photo', 'display_url': 'pic.twitter.
com/JDv3Iz9L54', 'expanded_url': 'https://twitter.com/Space_Station/status/874353511328169984/photo/1', 'media_url_https': 'https://pbs.twimg.com/media/DCJThr7UwAE9rCo.jpg', 'sizes': {'medi
um': {'resize': 'fit', 'w': 1200, 'h': 1114}, 'large': {'resize': 'fit', 'w': 2048, 'h': 1901}, 'thumb': {'resize': 'crop', 'w': 150, 'h': 150}, 'small': {'resize': 'fit', 'w': 680, 'h': 63
1}}, 'id': 874353115855634433}, {'media_url': 'http://pbs.twimg.com/media/DCJTi68UQAA1ZPE.jpg', 'url': 'https://t.co/JDv3Iz9L54', 'type': 'photo', 'display_url': 'pic.twitter.com/JDv3Iz9L54
', 'expanded_url': 'https://twitter.com/Space_Station/status/874353511328169984/photo/1', 'media_url_https': 'https://pbs.twimg.com/media/DCJTi68UQAA1ZPE.jpg', 'sizes': {'medium': {'resize'
: 'fit', 'w': 1134, 'h': 1200}, 'small': {'resize': 'fit', 'w': 643, 'h': 680}, 'large': {'resize': 'fit', 'w': 1935, 'h': 2048}, 'thumb': {'resize': 'crop', 'w': 150, 'h': 150}}, 'id': 874
353137066196992}, {'media_url': 'http://pbs.twimg.com/media/DCJTjY3UwAQbErC.jpg', 'url': 'https://t.co/JDv3Iz9L54', 'type': 'photo', 'display_url': 'pic.twitter.com/JDv3Iz9L54', 'expanded_u
rl': 'https://twitter.com/Space_Station/status/874353511328169984/photo/1', 'media_url_https': 'https://pbs.twimg.com/media/DCJTjY3UwAQbErC.jpg', 'sizes': {'medium': {'resize': 'fit', 'w':
1200, 'h': 1149}, 'large': {'resize': 'fit', 'w': 2048, 'h': 1960}, 'thumb': {'resize': 'crop', 'w': 150, 'h': 150}, 'small': {'resize': 'fit', 'w': 680, 'h': 651}}, 'id': 87435314509832192
4}], 'urls': [], 'favorited': True, 'full_text': '19 years ago today on June 12, 1998, Shuttle-Mir program ended when space shuttle Discovery landed with Mir crew member Andy Thomas. https:
//t.co/JDv3Iz9L54', 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'user': {'following': True, 'profile_background_image_url': 'http://pbs.twimg.com/profile
_background_images/517439388741931008/iRbQw1ch.jpeg', 'location': 'Low Earth Orbit', 'utc_offset': -18000, 'profile_background_color': 'C0DEED', 'description': "NASA's page for updates from
 the International Space Station, the world-class lab orbiting Earth 250 miles above. For the latest research, follow @ISS_Research.", 'favourites_count': 5099, 'id': 1451773004, 'profile_i
mage_url': 'http://pbs.twimg.com/profile_images/822552192875892737/zO1pmxzw_normal.jpg', 'profile_sidebar_fill_color': 'DDEEF6', 'screen_name': 'Space_Station', 'time_zone': 'Central Time (
US & Canada)', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1451773004/1497549511', 'friends_count': 234, 'verified': True, 'profile_text_color': '333333', 'profile_link_col
or': '0084B4', 'url': 'https://t.co/9Gk2H0gekn', 'created_at': 'Thu May 23 15:25:28 +0000 2013', 'name': 'Intl. Space Station', 'listed_count': 7602, 'followers_count': 1635947, 'statuses_c
ount': 6976, 'lang': 'en'}, 'favorite_count': 1260, 'created_at': 'Mon Jun 12 19:51:36 +0000 2017', 'hashtags': [], 'user_mentions': [], 'retweet_count': 415, 'retweeted': True, 'lang': 'en
', 'id': 874353511328169984}

print("tweet dict from AsDict() has media:")
print(as_dict_output["media"])

new_from_json_tweet = twitter.models.Status.NewFromJsonDict(as_dict_output)
print("tweet deserialized by NewFromJson doesn't have media:")
twitter.models.Status.NewFromJsonDict(as_dict_output)
print(new_from_json_tweet.media)

kwargs_tweet = twitter.models.Status(**as_dict_output)
print("tweet created by passing the dict instead of kwargs has media:")
print(kwargs_tweet.media)
print("but it's just a list of dicts, not a list of twitter.models.Media instances")

Looking at the serialization/deserialization code in models.py it seems the issue is caused by the serialization code putting the dicts representing the media instances to key called media but then in NewFromJsonDict() it's looking for media in a dict in the "entities" or "extended_entities" key, not in the toplevel dict namespace:

        if 'entities' in data:
            if 'urls' in data['entities']:
                urls = [Url.NewFromJsonDict(u) for u in data['entities']['urls']]
            if 'user_mentions' in data['entities']:
                user_mentions = [User.NewFromJsonDict(u) for u in data['entities']['user_mentions']]
            if 'hashtags' in data['entities']:
                hashtags = [Hashtag.NewFromJsonDict(h) for h in data['entities']['hashtags']]
            if 'media' in data['entities']:
                media = [Media.NewFromJsonDict(m) for m in data['entities']['media']]

        # the new extended entities
        if 'extended_entities' in data:
            if 'media' in data['extended_entities']:
                media = [Media.NewFromJsonDict(m) for m in data['extended_entities']['media']]

So these possible solutions come to me mind:

AsDict() should place the list of media dicts to ["entities"]["media"] or ["extended_entitites"]["media"] instead to the toplevel "media" key
NewFromJson() should look for media also in the top-level "media" key

jeremylow commented 6 years ago

For storage and caching, I'd use the twitter.Status()._json attribute since that is straight from twitter. I'll look into this, but I don't think it will ever be a true one to one from dict -> dict since there are going to be things like empty lists (hashtags etc.) that don't get set on the NewFromJson method that are present (even if they're empty) on the original AsDict (if that makes sense).

lifenautjoe commented 4 years ago

I'm also seeing no media being populated in the Status. 3 years after this was opened, is there a way forward?

bear / python-twitter

Tweet serialization & deserialization looses (at least) data about media #484