TumblThreeApp / TumblThree

A Tumblr and Twitter Blog Backup Application
https://TumblThreeApp.github.io
MIT License
596 stars 73 forks source link

BUG: Texts are cutoff mid sentence. #475

Closed T-prog3 closed 10 months ago

T-prog3 commented 11 months ago
  1. I was downloading some twitter text posts and realized that a majority of them was cut off mid sentence and ending with . Shouldn't TumblThree capture whole posts?

  2. All objects have "url":null But this should point to the post URL?

I was expecting to be downloading full text posts with all basic data, ID DATETIME TEXT but only got ID and some of the TEXT. DATETIME doesn't seem to be extracted at all?

Desktop (please complete the following information):

thomas694 commented 11 months ago

Please provide a link to an affected post.

T-prog3 commented 11 months ago

All affected post seems to be reposted aka retweets.

Here are some examples, just add the id to the following url https://twitter.com/QuayBooksStore/status/ to get the full post:

{"id":"1681358856314667036","text":"RT @QuayBooksStore: 16 large  pages with  magnificent pop-ups including the Ornithopter - da Vinci's design for a flying machine,  Virgin a…","url":null}
{"id":"1674092339269083137","text":"RT @QuayBooksStore: In Limerick to see @kraftwerk ? Don't miss Ireland's most amazing small bookstore...a place to discover a rarity, a for…","url":null}
{"id":"1664648624402300932","text":"RT @QuayBooksStore: Every day on Sarsfield Street, a place to discover a rarity, a foreign work, a novel, an essay, a classic, a curiosity.…","url":null}
{"id":"1658737014324297728","text":"RT @QuayBooksStore: Every day on Sarsfield Street, a place to discover a rarity, a foreign work, a novel, an essay, a classic, a curiosity.…","url":null}
{"id":"1656542804930117632","text":"RT @QuayBooksStore: Warm welcome to everyone visiting Limerick for the  #housingpractitionersconference.  Inviting you to take a little tim…","url":null}
{"id":"1636663432542932992","text":"RT @QuayBooksStore: Visiting Limerick? Pick up a good read at our charming small bookstore, then round out your visit by enjoying all that…","url":null}
{"id":"1613806408994095105","text":"RT @QuayBooksStore: For rarities, classics, curiosities, and various unexpected titles, we invite you to try rummaging in the organic and l…","url":null}
{"id":"1603164706927255553","text":"RT @QuayBooksStore: Responding to the RTÉ news tweet that Amazon yesterday opened a processing centre of 630,000 sq ft in Dublin. IndieBoun…","url":null}

But there seem to be a deeper issue as well because when i tried to download only text-posts and meta data from random news channels i only got a fraction of all texts available. And those that i did get made no sense because of missing DATETIME extraction.

thomas694 commented 11 months ago

... from random news channels i only got a fraction of all texts available.

Do you mean some text posts haven't been downloaded at all? Then please provide a link to an affected post.

T-prog3 commented 11 months ago

Exactly! For example, if i have all settings enabled except for thumbnails, rebloggs and videos and then add https://twitter.com/Reuters and start downloading and stop around 100 downloaded to evaluate the result i got the following:

Number of posts: 1069474 [x] Download images: 104 of 118, duplicates found: 0 [ ] Download videos: 1 of 6, duplicates found 5 [x] Download text posts: 3 of 4, duplicates found 0

If i then look inside the folder i can clearly see pictures from posts in their twitter feed. All posts have both image and text but only 3 texts got downloaded.

{"id":"1710679457969885647","text":"LIVE: Gaza skyline after barrage of rockets launched into Israel https://t.co/gOFnmcRAx9","url":null}
{"id":"1710651297743974596","text":"World reacts to surprise attack by Hamas on Israel https://t.co/OmItu0wjBi","url":null}
{"id":"1710642360173150341","text":"Gaza skyline after Hamas launches rockets into Israel https://t.co/kuqfikxsYn","url":null}

Of these 3 posts 2 had a live broadcast attached to it and 1 had an image that were not downloaded. When Reuters have 1 text + 1 image each post, then this should also have been reflected in the results as well.

thomas694 commented 11 months ago

All affected post seems to be reposted aka retweets.

The "full text" field in a retweet sometimes only contains a shortened version of the text. For these posts we'll use the text of the original tweet.

... but only 3 texts got downloaded.

As these were real text posts. It has historical reasons and is coming from old Tumblr code. So far, no one has extensively used the text download or reported a problem. Originally, a post was of one type (e.g. image, video or text) and its download could be enabled/disabled in the settings. At least for Twitter a post can contain text and one media type, in the meantime it's also possible to mix all the different media types in one post. In consequence, the counters will increase more quickly from now on, as one post can be counted for multiple types.

You can test it here.

T-prog3 commented 11 months ago

Ah, i thought it could be something like that. I've tested the new version and it works great! There's only a minor issue that i have noticed. In all cases, no matter if the downloaded post have text or not, it still captures the URL of the image/video as part of the text in "text":

The result of this is that:

  1. Each object with a media type will have 2 links to the post, 1 in "text": and 1 in "url":
  2. Posts that only have a media type and 0 text gets captured as text with just links to the post.
  3. It becomes hard to identify if the link in "text": is to the media or if its an outgoing link to another site. This is due to twitters use of short links https://t.co/*****

On this topic regarding links in the text think that the downloader should capture the true link that you see when you hover over links in the original posts. As in this example: https://twitter.com/Reuters/status/1710642360173150341 Twitter original post show cutoff names twitter.com/i/broadcasts/1... in text. Hovering over reveals the true https://twitter.com/i/broadcasts/1rmGPMaqvpjJN But the downloader as of now captures the shorturl https://t.co/kuqfikxsYn

thomas694 commented 11 months ago

We don't change the "full text" field's content and use it as delivered by Twitter to fill the text field. They append these short URLs. For the url field in texts.txt we maybe change it to search in the following order:

  1. post's URL field (seems to be mostly empty)
  2. post content's (expanded) URL, if any
  3. we build a fallback URL to the post

I'll make the changes the next days.

thomas694 commented 10 months ago

new test version here

T-prog3 commented 10 months ago

That version is way worse in my opinion. The biggest difference now is that the "url":[] field will be empty 99.9% unless the text field have links to an external domain.

1. Lets say that this is the original text:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed Google.com do eiusmod tempor incididunt ut labore et dolore magna aliqua. Youtube.com/myvideo"

2. As of now the softwares text output becomes this:

{"id":"1234567890","date":"2043-12-10 13:59:46Z","text":"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed https://t.co/googleURL do eiusmod tempor incididunt ut labore et dolore magna aliqua. https://t.co/youtubeURL ","url":["https://google.com", "https://youtube.com/myVideo"]}

3. Expected result:

{"id":" uniqueID ","date":" datePosted ","text":" Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed https://google.com do eiusmod tempor incididunt ut labore et dolore magna aliqua. https://youtube.com/myVideo ","url":" linkToPostSource "}

In other words, what you expect is to get {"KEY","DATE", "ORIGINAL-TEXT-YOU-CAN-UNDERSTAND-WITH-EASE", "SOURCE-LINK" }.

4. Additional Problems with posts containing media

If the downloaded post ONLY contains images/video and no text then you get outputs like these: {"id":"uniqueID ","date":" datePosted ","text":"https://t.co/linkToMediaPost1 ","url":[]} {"id":"uniqueID ","date":" datePosted ","text":"https://t.co/linkToMediaPost2 ","url":[]} {"id":"uniqueID ","date":" datePosted ","text":"https://t.co/linkToMediaPost3 ","url":[]}

I do not regard these as text to be downloaded because it doesn't give any additional value. I can already get to this postlink with ease if i save images with %i and add that id after twitter.com/ So in my opinion these https://t.co/linkToMediaPost that twitter append to all posts with media is totally useless.

Edit: One more thought is that texts.txt objects should be formatted in JSON structure

{ "id":"uniqueID ", "date":" datePosted ", "text":" text", "url":[] }

It becomes easier to read the text when everything is structured and not just rows of csv

thomas694 commented 10 months ago

1-3 So actually, URL shall always be a link to the post. I'm not sure, if in text the links should be kept the way it originally is on Twitter or be replaced by "easily readable" links. Does someone else have an opinion?

4 These entries will be ignored.

We'll format the JSON output.

T-prog3 commented 10 months ago

In my mind URLis a reference to the original source and not a list of outgoing links encountered.

"Easily readable" links is the original way in which Twitter presents it to the user on their platform, this is so their users know what they are clicking on. All of their https://t.co/** is only a thing to the outside world, when you copy/share a link, or in this case, download.

And i do share this view that users want to know where links will take them. If someone links their Facebook, Instagram, and other social media pages then i would want to know what link takes me where. If all links are https://t.co/RandomCHARS then i have to open all links to find the one i am looking for.

thomas694 commented 10 months ago

Ok, after reading your answer I looked on a few blogs in the browser. In the browser the shortened real URL is displayed and in a tooltip the full URL is shown, the link first leads to their short URL service. If in the returned data structure the real URLs are "hidden" behind these short URLS, we should use the real URL similiar to what can be seen on the web.

How should URL replacement be handled as URLs aren't only at the end? Just using the full URL (brackets or not?):

{ "id": "123", "date": "...", "text": "Lorem ipsum dolor sit amet, consetetur https://domain.tld/some/very/long/url/path1/path2/some_long_page_name?param1=value&param2=value2 elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore [https://domain.tld/some/other/even/longer/url/with/long/path1/path2/some_extra_long_page_name?param3=value3&param4=value4] aliquyam erat, sed diam voluptua. At vero eos et accusam et.", "url": [] },

Is it more readable by using their short URLs in the text and a mapping structure "links"? We could also use their shortened real URLs (as shown in the web), but that could sometimes be ambiguous in text version, or a place holder like "Link1", "[Link2]" (brackets or not?).

{ "id": "123", "date": "...", "text": "Lorem ipsum dolor sit amet, consetetur https://t.co/DsZzKaYYYa elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore https://t.co/eafmVazBWZ aliquyam erat, sed diam voluptua. At vero eos et accusam et.", "links": { "https://t.co/DsZzKaYYYa": "https://domain.tld/some/very/long/url/path1/path2/some_long_page_name?param1=value&param2=value2", "https://t.co/eafmVazBWZ": "https://domain.tld/some/other/even/longer/url/with/long/path1/path2/some_extra_long_page_name?param3=value3&param4=value4" }, "url": [] },

Or any other suggestion?

T-prog3 commented 10 months ago

My thoughts are these:

1.

Lorem ipsum dolor sit amet, consetetur https://domain.tld/some/very/long/url/path1/path2/some_long_page_name?param1=value&param2=value2 elitr, sed diam nonumy eirmod tempor

This is the truest form in which everything actually looks behind the hood. It can be an issue in rare cases if a link in the middle of a text is extremely long.

2.

invidunt ut labore et dolore [https://domain.tld/some/other/even/longer/url/with/long/path1/path2/some_extra_long_page_name?param3=value3&param4=value4] aliquyam erat, sed diam voluptua. At vero eos et accusam et.",

Brackets only tells the story about where the link begin and where it ends so this could be preferable to guard against confusion.

  1. Your other example is better in solving potential issues with long links in the middle but is less intuitive. Users really have to be observant to notice that the short URL in "text": is same as in"links":. That it is just a reference to something else that wants to point you in the direction of the full URL.

If you want to go with this approach then a placeholder like [LinkRef*] is easier to understand than using the short URL. The only downside to this solution that i can think of is that it may take away from the natural understanding of the text.

If we have a text that says Check out my http://instagram.com/username page to see my latest posts Then it would be Check out my [LinkRef*] page to see my latest posts This could make texts difficult to read if you have to check the reference.

4. Another solution that i can think of and would prefer is to save our text files as .md Markdown instead of .txt and make all links markdown links [myShortText](https://domain.tld/some/other/even/longer/url/with/long/path1/path2/some_extra_long_page_name?param3=value3&param4=value4) then we could have the option to see the text presented in two ways 1. Markdown rendered and presented exactly as twitter does without showing long URLs 2. Raw source showing everything behind the hood as it really is.

thomas694 commented 10 months ago

As far as I know it's not possible to have link texts on Twitter, so option 4 doesn't make much sense, or not? That would end up in a shortened URL of 23 characters followed by the full URL. It doesn't have to be option 3 (with "LinkRef") either, I only thought it helps readability by not having to skip long links in the text. So we could take option 2, but... if you prefer to open texts.txt in your markdown viewer, then we should go with option 1, because only then the links are rendered clickable. Which one you prefer after all our considerations?

T-prog3 commented 10 months ago

My thinking is something like this:

By observing Twitter on the web i can see that they have three types of links #HashtagLinks, @UserLinks and Website links: Google.com The first two is auto generated by Twitter with use of # and @ who corresponds to:

<a dir="ltr" href="/hashtag/Halloween?src=hashtag_click" role="link"">#Halloween</a> <a dir="ltr" href="/username" role="link">@username</a>

Then we have website links who have different parts in it

<a dir="ltr" href="https://t.co/N1Ri3oAGAd" rel="noopener noreferrer nofollow" target="_blank" role="link"> Part 1: Twitters short URL who will redirect us to the true destination when clicked.

<span aria-hidden="true">https://</span> Part 2: First part of the true URL

teraboxapp.com/s/1FdOh3L5Vudm Part 3: Second part of the true URL and this is also the link text that our users who look at this tweet will see.

<span aria-hidden="true" >dy4POghpQ9A</span> Part 4: Last part of the true URL. This part will only exist IF the URL is considered too long.

<span aria-hidden="true" ">…</span> Part 5: Extra part that is being appended to Part 3 IF the URL is considered too long. </a>

So Twitter show (Part3 || Part3 + Part5) in text to their users. When clicked it takes them to Part1who contains (Part2 + Part3 || Part2 + Part3 + Part4) this is also shown in a tooltip.

Relation to this software:

One of the problems we have is the loss of certain information:

  1. If we only use the shortURL inPart1 then we lose the information in Part3 but this is the most important information to understand the text. Any substitution of Part3 will also create loss of comprehension,
  2. If we only use Part3 then we end up with broken links because of missing Part4 and it will not be clickable because of missing Part2. So this is also a loss of important information.

So we do need to have combined parts to get all information but if we use Part2 + Part3 + Part4 and have the full URL in text then we have the issue of very long links that makes readability difficult. So as far as i can see, the only solution that solves all issues in a .txt format is to use something that resembles Part 3 in text and go with your ``"links": { Key : Value} solution.

Markdown:

Because of the issues above i was thinking that markdown would have been the dream. If we could have a Tweet that looks like this:

Lorem ipsum @MyCoolName, Check out my site: (Part3)

Cool #Cat

And turn it into this: @MyCoolName -> [@MyCoolName](https://twitter.com/MyCoolName) #Cool -> [#Cool](https://twitter.com/hashtag/Cool) #Cat -> [#Cat](https://twitter.com/hashtag/Cat) Part3 -> [Part3](Part2 + Part3 + Part4) == [myCoolSite.com/very](https://myCoolSite.com/veryLongURL)

Our downloaded tweet text would then look like this with working links in md:

Lorem ipsum @MyCoolName, Check out my site: myCoolSite.com/very #Cool #Cat

This would have been a true representation of all information available without any major loss and with clean texts. To me the most important thing is keeping information as close to original view as possible because its the essence of it all. Everything else like full links and having clicking ability is only attached information that you may want to know/have but is less important.

At this moment i have no other ideas how to solve all of the problems so do what you think is best. I personally use Notepad3 to view all .txt files so [ ] or not doesn't really matter to me.

thomas694 commented 10 months ago

The issue has been fixed and closed. You can still comment. Feel free to ask for reopening the issue if needed.