Mincka / DMArchiver

A tool to archive the direct messages, images and videos from your private conversations on Twitter
GNU General Public License v3.0
223 stars 25 forks source link

Text added to cards may be incomplete #31

Open Mincka opened 7 years ago

Mincka commented 7 years ago

When a link is shared and user adds additional text, the added text may not be included in the log.

In the following generated sample, "This is a test." is not included.

  <p class="TweetTextSize  js-tweet-text tweet-text" lang="" data-aria-label-part="0">How I lost my 25-year battle against corporate claptrap <a href="https://t.co/gIrbtXuRSv" rel="nofollow noopener" dir="ltr" data-expanded-url="https://www.ft.com/lucycolumn" class="twitter-timeline-link" target="_blank" title="https://www.ft.com/lucycolumn" >
        <span class="tco-ellipsis"/>
        <span class="invisible">https://www.</span>
        <span class="js-display-url">ft.com/lucycolumn</span>
        <span class="invisible"/>
        <span class="tco-ellipsis">
            <span class="invisible">&nbsp;</span>
        </span>
    </a> This is a test.</p>

This is because cssselect extracts only the text node before the . A workaround could be to use text_content():

def _parse_dm_text(self, element):
    dm_text = '' text_tweet = element.cssselect("p.tweet-text")[0]
    dm_text = text_tweet.text_content()
    return DirectMessageText(dm_text)

The output would be: [2017-08-16 13:37:49] <Julien Ehrhart> [Card-summary_large_image] https://www.ft.com/lucycolumn How I lost my 25-year battle against corporate claptrap https://www.ft.com/lucycolumn This is a test.

Two issues here:

  1. The link appears twice (once during the parsing of the card, once during the parsing of the text) -> Acceptable
  2. The emojis are not in the text so they are stripped from the output -> Not acceptable