DocNow / twarc

A command line tool (and Python library) for archiving Twitter JSON
https://twarc-project.readthedocs.io
MIT License
1.36k stars 255 forks source link

tweet_mode=extended as default #168

Closed edsu closed 7 years ago

edsu commented 7 years ago

It appears that fetching tweets from the status/lookup and search/tweets endpoints without using the tweet_mode=extended option means that important media entities are not included in results.

Consider this tweet that contains an embedded video. When fetched with the default tweet_mode=compat the entities looks like this:

    "entities": {
        "hashtags": [],
        "symbols": [],
        "urls": [
            {
                "display_url": "twitter.com/i/web/status/8\u2026",
                "expanded_url": "https://twitter.com/i/web/status/896431361308991489",
                "indices": [
                    117,
                    140
                ],
                "url": "https://t.co/iAFpYVouXz"
            }
        ],
        "user_mentions": []
    },

But when fetched with tweet_mode=extended explicitly set it looks like:

    "entities": {
        "hashtags": [],
        "media": [
            {
                "display_url": "pic.twitter.com/w1V4kEkHVx",
                "expanded_url": "https://twitter.com/cvillenews_desk/status/896431361308991489/video/1",
                "id": 896431022002368512,
                "id_str": "896431022002368512",
                "indices": [
                    133,
                    156
                ],
                "media_url": "http://pbs.twimg.com/ext_tw_video_thumb/896431022002368512/pu/img/mkeClcWY2tdmaZMN.jpg",
                "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/896431022002368512/pu/img/mkeClcWY2tdmaZMN.jpg",
                "sizes": {
                    "large": {
                        "h": 1280,
                        "resize": "fit",
                        "w": 720
                    },
                    "medium": {
                        "h": 1067,
                        "resize": "fit",
                        "w": 600
                    },
                    "small": {
                        "h": 604,
                        "resize": "fit",
                        "w": 340
                    },
                    "thumb": {
                        "h": 150,
                        "resize": "crop",
                        "w": 150
                    }
                },
                "type": "photo",
                "url": "https://t.co/w1V4kEkHVx"
            }
        ],
        "symbols": [],
        "urls": [],
        "user_mentions": []
    }

tweet_mode=extended also results in a very useful extended_entities stanza that looks like:

    "extended_entities": {
        "media": [
            {
                "additional_media_info": {
                    "monetizable": false
                },
                "display_url": "pic.twitter.com/w1V4kEkHVx",
                "expanded_url": "https://twitter.com/cvillenews_desk/status/896431361308991489/video/1",
                "id": 896431022002368512,
                "id_str": "896431022002368512",
                "indices": [
                    133,
                    156
                ],
                "media_url": "http://pbs.twimg.com/ext_tw_video_thumb/896431022002368512/pu/img/mkeClcWY2tdmaZMN.jpg",
                "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/896431022002368512/pu/img/mkeClcWY2tdmaZMN.jpg",
                "sizes": {
                    "large": {
                        "h": 1280,
                        "resize": "fit",
                        "w": 720
                    },
                    "medium": {
                        "h": 1067,
                        "resize": "fit",
                        "w": 600
                    },
                    "small": {
                        "h": 604,
                        "resize": "fit",
                        "w": 340
                    },
                    "thumb": {
                        "h": 150,
                        "resize": "crop",
                        "w": 150
                    }
                },
                "type": "video",
                "url": "https://t.co/w1V4kEkHVx",
                "video_info": {
                    "aspect_ratio": [
                        9,
                        16
                    ],
                    "duration_millis": 68512,
                    "variants": [
                        {
                            "bitrate": 832000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/896431022002368512/pu/vid/360x640/B21CrJnRz1IBCiHs.mp4"
                        },
                        {
                            "bitrate": 2176000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/896431022002368512/pu/vid/720x1280/NqQgDw4_5Zv2swqw.mp4"
                        },
                        {
                            "bitrate": 320000,
                            "content_type": "video/mp4",
                            "url": "https://video.twimg.com/ext_tw_video/896431022002368512/pu/vid/180x320/iLU0F2Uy3VjOsdw3.mp4"
                        },
                        {
                            "content_type": "application/x-mpegURL",
                            "url": "https://video.twimg.com/ext_tw_video/896431022002368512/pu/pl/UtvqLHt1TuwP98Ea.m3u8"
                        }
                    ]
                }
            }
        ]
    }
edsu commented 7 years ago

Unfortunately I didn't catch this tweet in the filter stream I had running. So I don't know how the media metadata would've appeared in it. But the sketchy docs and a bit of experimenting makes me believe that this info would have appeared in the extended_entities stanza of a streamed tweet.

dwillis commented 7 years ago

Agree with this, and also would like to see an option for using json2csv.py with json files created using extended mode.

edsu commented 7 years ago

This was fixed in the v1.2.0 release that is now out on PyPI.