Mincka / DMArchiver

A tool to archive the direct messages, images and videos from your private conversations on Twitter
GNU General Public License v3.0
222 stars 25 forks source link

XMLSyntaxError: switching encoding: encoder error #1

Closed denisjacquemin closed 7 years ago

denisjacquemin commented 7 years ago

Edited by Mincka on August 10th 2017: For anybody Googling for this error message XMLSyntaxError: switching encoding: encoder error:

Possible workarounds: 1) Strip the emojis on macOS before the parsing, see this implementation in 073a3589280ee513b404051a4b1c68f80ccbb590 2) Downgrade to Python 3.4 if you can. I attempted to upgrade to Python 3.6 but had other compatibility issues, this time with pyinstaller, so I was unable to move forward. Downgrade to Python 3.4 allow my tool to work perfectly on all platforms. 3) Remove lxml package and reinstall it using STATIC_DEPS=true (https://github.com/lorien/grab/issues/199#issuecomment-297721800). However, I cannot guarantee this will work. Using multiple Python versions on macOS is such a huge pain. 😞

Original message: My setup:

$ dmarchiver
Enter your username or email: myusername
Enter your password (characters will not be displayed): 
Authentication succeedeed.
Conversation ID not specified. Retrieving all the threads.
Starting crawl of '################'
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in <module>
    load_entry_point('dmarchiver==0.0.5', 'console_scripts', 'dmarchiver')()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 67, in main
    crawler.crawl(thread_id, args.download_images, args.download_gifs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 443, in crawl
    tweets, download_images, download_gif)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 357, in _process_tweets
    document = lxml.html.fragment_fromstring(value)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
    base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
  File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
  File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
  File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
  File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
  File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
LaurentLC commented 7 years ago

Hi there, I basically have the same error, trying to download a huge DM thread:

Conversation ID specified (xxxxx). Retrieving only one thread.
Starting crawl of 'xxxxx'
Traceback (most recent call last):
  File "/usr/local/bin/dmarchiver", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main
    args.download_gifs)
  File "/usr/local/lib/python3.5/site-packages/dmarchiver/core.py", line 463, in crawl
    tweets, download_images, download_gif)
  File "/usr/local/lib/python3.5/site-packages/dmarchiver/core.py", line 377, in _process_tweets
    document = lxml.html.fragment_fromstring(value)
  File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
    base_url=base_url, **kw)
  File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
  File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
  File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
  File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
  File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
  File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

Note that the script has been able to download perfectly a short thread (just a few DM, no images no nothing).

Mincka commented 7 years ago

Hello Laurent,

Are you also using macOS? It seems there is an error with the lxml library when it reaches a message with accented characters. Could you confirm there is no accented characters for the short thread which is working for you?

It's quite difficult for me to identify the exact cause because I do not own a Mac to debug it. It works properly on Windows and Linux. I keep looking for a possible fix for macOS.

There's a command to run for the UTF-8 support in the Terminal which should be executed before the script but I'm not sure it would make a difference here: export PYTHONIOENCODING=utf-8

LaurentLC commented 7 years ago

Hi, As I said via email (I thought it would be also posted here, whatever), I do have more or less the same conf: Mac OS 10.11.6, Python 3.5, lxml 3.6.4. Unfortunately, the short thread that worked also contains accented characters (damn french people), so that's probably not about that…

I tried to execute the command you gave, but the problem is still there.

Thanx for the help, it would be really cool to have this script work.

Mincka commented 7 years ago

I'm going to add a raw mode to fetch JSON responses without using the parser. I will also add a verbose mode and add proper error handling. I hope it will help us to find the root cause. Thanks for the tests.

LaurentLC commented 7 years ago

Zupa. Keep up the good work, looking forward to testing it :)

LaurentLC commented 7 years ago

(BTW, just tested the windows exe on a basic Windows 10 Family, worked perfectly fine with every king of DM thread… good job) (but it seems than the GMT is not correct, like the french +2 are missing)

Mincka commented 7 years ago

Yep. I've already updated the script to use the time of the locale instead of the UTC one. It has not been pushed yet to GitHub. And for the error, it confirms the issue is related to the macOS setup.

Mincka commented 7 years ago

Thanks to a friend of mine with a Mac, I've been able to track down what seems to be the root cause of this bug.

The parsing fails when a tweet contains an emoji. The generated code will look like this for the image. <img title="Visage avec des larmes de joie" class="Emoji Emoji--forText" draggable="false" aria-label="Emoji: Visage avec des larmes de joie" alt="πŸ˜‚" src="https://abs.twimg.com/emoji/v2/72x72/1f602.png">

It contains the alt attribute with the unicode character of the smiley (πŸ˜‚).

With this new information, I've found this bug ticket with a similar issue: https://bugs.launchpad.net/lxml/+bug/1538213

Additional tests have been done on macOS and no issue has been identified with multiple kinds of accented characters or URL. This issue only seems to occur with emoji unicode.

Consequently, I'm going to do the following: 1) Implement a platform specific workaround for Mac OS with platform detection.

from sys import platform

# Mac OS lxml bug workaround
if platform == "darwin":
    # Inject emojis' titles into alt attributes, replacing unicode tweet's emojis
    # to prevent encoding error with lxml while keeping a coherent alt attribute
    value = re.sub('title="(.*?)".*?class="Emoji.*?alt="(.*?)"', '\1', value)

or simpler alternative

if platform == "darwin":
    # Clear alt attributes of emojis
    value = re.sub(r'(class="Emoji.*?)alt=".*?"', r'\g<1> alt=""', value)

2) Add a proper try / catch for the parsing 3) Complete the bug ticket

LaurentLC commented 7 years ago

\o/

Mincka commented 7 years ago

Could you just confirm there was no emoji for the thread you've been able to parse on macOS, Laurent?

LaurentLC commented 7 years ago

Yes, it was an old and short thread with no emojis at the time…

muesliq commented 7 years ago

Having the exact same problem. Happy to hear you're working on a fix!

\o/ (not using emoji in order not to break anything ;-)

Mincka commented 7 years ago

I think I have a fix in b7c316a for the Mac OS users but I need confirmation guys. You can now upgrade the package and test again. πŸ˜„

$ pip3 install dmarchiver --upgrade
$ dmarchiver
muesliq commented 7 years ago

I did. Got a little further this time: 3 images (instead of 0), 0 text files. Error:

Authentication succeedeed. Conversation ID specified (123). Retrieving only one thread. Starting crawl of '123' Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in load_entry_point('dmarchiver==0.0.7', 'console_scripts', 'dmarchiver')() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main args.download_gifs) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 463, in crawl tweets, download_images, download_gif) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 377, in _process_tweets document = lxml.html.fragment_fromstring(value) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 825, in fragment_fromstring base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 786, in fragments_fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 752, in document_fromstring value = etree.fromstring(html, parser, **kw) File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737) File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674) File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220) File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345) File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584) File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238) File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147) lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

Maybe something went wrong with the update? I got this:

Collecting dmarchiver Downloading dmarchiver-0.0.8.zip Collecting requests>=2.11.1 (from dmarchiver) Using cached requests-2.11.1-py2.py3-none-any.whl Collecting lxml>=3.6.4 (from dmarchiver) Using cached lxml-3.6.4.tar.gz Collecting cssselect>=0.9.2 (from dmarchiver) Using cached cssselect-1.0.0-py2.py3-none-any.whl Installing collected packages: requests, lxml, cssselect, dmarchiver Exception: Traceback (most recent call last): File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/basecommand.py", line 215, in main status = self.run(options, args) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/commands/install.py", line 342, in run prefix=options.prefix_path, File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_set.py", line 784, in install kwargs File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_install.py", line 849, in install self.move_wheel_files(self.source_dir, root=root, prefix=prefix) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_install.py", line 1062, in move_wheel_files isolated=self.isolated, File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/wheel.py", line 345, in move_wheel_files clobber(source, lib_dir, True) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/wheel.py", line 316, in clobber ensure_dir(destdir) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/utils/init**.py", line 83, in ensure_dir os.makedirs(path) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/requests'

Mincka commented 7 years ago

@muesliq: It seems you're using the wrong version of Python (2.7 instead of 3.5). Could you try with pip3 install dmarchiver --upgrade?

That's my fault. It's mandatory to specify pip3 for Mac OS X because both version are installed. I've updated my previous post.

And I guess you've been able to download more images only because those images have been uploaded recently, without emojis in tweets in or after them.

muesliq commented 7 years ago

Updated, thanks! Better now but not fixed yet. Thousands of tweets processed, 129 images, yet still 0 text files.

Authentication succeedeed. Conversation ID specified (123). Retrieving only one thread. Starting crawl of '123' Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in load_entry_point('dmarchiver==0.0.8', 'console_scripts', 'dmarchiver')() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main args.download_gifs) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 470, in crawl tweets, download_images, download_gif) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 384, in _process_tweets document = lxml.html.fragment_fromstring(value) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 825, in fragment_fromstring base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 786, in fragments_fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 752, in document_fromstring value = etree.fromstring(html, parser, **kw) File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737) File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674) File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220) File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345) File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584) File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238) File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147) lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

Mincka commented 7 years ago

Ok, thanks. I've added an exception handling to print the tweet ID that raises the exception. The script should now continue, even when a tweet is causing issues.

You can upgrade with pip3 install dmarchiver --upgrade.

This is a poor, temporary solution but the raw HTML of the offensive tweets will be also output in the log file as a [DMConversationEntry] with a [ParseError] tag. It will help me to understand what's causing the issue.

The only weird situation I saw is a random position of the img attributes that makes the regex fail. I've seen title before alt on a computer and after alt on another... Maybe that's the same here with class or it's possible it could be emoji used in cards or other content types.

muesliq commented 7 years ago

Now the upgrade doesn't seem to work:

pip3 install dmarchiver --upgrade Requirement already up-to-date: dmarchiver in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages Requirement already up-to-date: requests>=2.11.1 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver) Requirement already up-to-date: lxml>=3.6.4 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver) Requirement already up-to-date: cssselect>=0.9.2 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver)

Mincka commented 7 years ago

I had the same issue. It's quite strange. Maybe a temporary issue with pipy?

I've been able to uninstall it and reinstall it with the latest version (0.0.10).

To exclude caching issues for package download, I've also deleted the following folder on Windows: C:\Users\[User]\AppData\Local\pip\cache

For Unix, its seems to be ~/.pip/cache/ but I'm not sure.

LaurentLC commented 7 years ago

Hi ! No problem with the upgrade here, and I had been able to archive a few DM threads, including big ones with emoji, pictures… Nice!

On error though, with one thread. Had a lot of Unexpected error for tweet 'xxxx', but still I continue.

The twitter user has an emoji in her username (see below begining of the file that has been written)

[DMConversationEntry] [ParseError] Parsing of tweet 'xxxx' failed. Raw HTML: <div class="DirectMessage
            DirectMessage--received

            clearfix dm js-dm-item"
            data-quick-reply-json="null"
            data-message-id="xxxx"
            data-item-id="xxxx"

            data-card-component="dm_existing_conversation_dialog"

            data-component-context="dm_existing_conversation_dialog">

  <div class="DirectMessage-container">
    <div class="DirectMessage-avatar">
      <a href="/xxxx" class="js-action-profile js-user-profile-link" data-user-id="xxxx">
  <div class="DMAvatar DMAvatar--1 u-chromeOverflowFix">
    <span class="DMAvatar-container">
      <img class="DMAvatar-image" src="xxxx alt="SabineLC πŸŽƒ">
    </span>
</div>

I guess it might be the problem..?

We're getting there!

muesliq commented 7 years ago

pip3 install dmarchiver --upgrade --ignore-installed seems to have done the trick. And it works just fabulous! You managed to fix the bugs, kudos!

Two tweets (out of 12620) hat an "unexpected error". The first one contained the letter πœ‹. The second had the following tweet embedded (which contained lots of emoji): https://twitter.com/magnifier661/status/787044538145574912

Mincka commented 7 years ago

Thanks a lot @LaurentLC and @muesliq! πŸ‘

You've been able to identify 3 currently not properly handled cases:

I'm not sure yet how I will be able to find proper workarounds. The bug is in the lxml lib for Mac OS. Identifying emojis with regex does not seem possible. The error with πœ‹ (U+1D70B πœ‹ MATHEMATICAL ITALIC SMALL PI) also means that the issue will not be limited to emojis. It's only a simple character so it could mean the script cannot handle non-ASCII characters at all on Mac OS... :-/

Update: My guess is the error is related to code points encoded on four bytes. https://en.wikipedia.org/wiki/Unicode

Code points in Planes 1 through 16 (supplementary planes) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8.

Emojis are also encoded in Plane 1 (1F000–​1FFFF) so I may drop all content in the range 10000-​2FFFF (Planes 1 & 2). It contains mainly ancient Egyptian characters, mathematical symbols and emojis.

For reference: http://stackoverflow.com/a/13752628/3049282

muesliq commented 7 years ago

By the way: Fantastic little piece of software. Thank you!

Mincka commented 7 years ago

Happy to help. πŸ˜„

I have implemented in 073a3589280ee513b404051a4b1c68f80ccbb590 a more general solution as a "fix" for this issue. On Mac OS X, all the Unicode characters encoded on 4 bytes are now replaced by "β–‘" before the lxml parsing.

Consequently, it should fix all the encountered issues and allow a flawless parsing. πŸ˜„

To celebrate this, I've bumped the version to 0.1.0. πŸ˜‰

Mincka commented 7 years ago

Rejoice Mac users, I've been able to make a precompiled executable for macOS. It should be a lot easier for non-technical users to use. πŸ˜„ https://github.com/Mincka/DMArchiver/releases/tag/0.1.0

Mincka commented 7 years ago

Fixed in 073a3589280ee513b404051a4b1c68f80ccbb590

sussron commented 7 years ago

OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart notifications@github.com wrote:

Closed #1 https://github.com/Mincka/DMArchiver/issues/1.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .

sussron commented 7 years ago

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart notifications@github.com wrote:

Closed #1 https://github.com/Mincka/DMArchiver/issues/1.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .

sussron commented 7 years ago

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart <notifications@github.com

wrote:

Closed #1 https://github.com/Mincka/DMArchiver/issues/1.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .

sussron commented 7 years ago

oh it didn't let me attach the 5MB file of the one particular message thread.

But here are all the various threads that were in the command screen. The most important one is the Starting crawl of '629006352329760768'

Last login: Mon Nov 7 20:43:14 on ttys000

Ronnies-MacBook-Pro:~ ronniesussman$ /Users/ronniesussman/Downloads/dmarchiver ; exit;

Enter your username or email: beckybulldognj

Enter your password (characters will not be displayed):

Authentication succeedeed.

Conversation ID not specified. Retrieving all the threads.

Starting crawl of '629006352329760768'

Begin of thread reached

Total processed tweets: 49899

Writing conversation to 629006352329760768.txt

[Truncated for confidentiality reasons]

logout

Saving session...

...copying shared history...

...saving history...truncating history files...

...completed.

[Process completed]

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much. Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:

Closed #1 https://github.com/Mincka/DMArchiver/issues/1.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .

sussron commented 7 years ago

Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much. Ronnie from New Jersey

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much. Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:

Closed #1 https://github.com/Mincka/DMArchiver/issues/1.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .

sussron commented 7 years ago

I'm not sure all the messages were backed up. i'm looking for 2 particular ones that i can't find, but i'm going to go through all the txt files and see that i didn't miss it.

Thanks! Ronnie

On Mon, Nov 7, 2016 at 9:12 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much. Ronnie from New Jersey

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much. Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:

Closed #1 https://github.com/Mincka/DMArchiver/issues/1.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .

sussron commented 7 years ago

Does seem it didn't capture all the conversations or go to the first line. Will note which message id if I can locate it on the source element page.

Thanks Ronnie

On Nov 7, 2016 9:37 PM, "Ronnie Sussman" sussron@gmail.com wrote:

I'm not sure all the messages were backed up. i'm looking for 2 particular ones that i can't find, but i'm going to go through all the txt files and see that i didn't miss it.

Thanks! Ronnie

On Mon, Nov 7, 2016 at 9:12 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much. Ronnie from New Jersey

On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:

Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?

Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.

Julien, thanks so much. Ronnie from New Jersey

On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:

now i got this screen

On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:

this is what it looked like as it was running before it got the error

On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:

OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?

On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:

Closed #1 https://github.com/Mincka/DMArchiver/issues/1.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .

Mincka commented 7 years ago

Hello Ronnie,

Glad to see you're getting better results. However, I am still no sure to understand what is the error message you're talking about. There is not known limitation of the thread size. If there is an error, it should appear in the generated file. Messages deleted by the users cannot be recovered.

If you want to download images and GIFs from your specific conversation (629006352329760768), you should try to run the command with the following parameters:

dmarchiver -id "629006352329760768" -di -dg

You should also be careful of the information sent on this site. The conversation ID for a conversation between two people is "userid1-userid2," so it could be possible to know with who you're talking to on Twitter.

sussron commented 7 years ago

Thanks for the message Julien. The error happened the first time but then it ran. I can see the dm messages in my twitter account so they aren't deleted. I can do a screen shot to show you. For the one long one It just takes a long time to scroll back.

That great script you wrote was awesome I could put in my name and password and it just went and did its thing. So cool! How would I now run it just for one conversation with images. Just go to the command screen and type that line instead of using the zip link I downloaded?

Thanks Ronnie

On Nov 8, 2016 2:57 AM, "Julien Ehrhart" notifications@github.com wrote:

Hello Ronnie,

Glad to see you're getting better results. However, I am still no sure to understand what is the error message you're talking about. There is not known limitation of the thread size. If there is an error, it should appear in the generated file. Messages deleted by the users cannot be recovered.ca

If you want to download images and GIFs from your specific conversation (629006352329760768), you should try to run the command with the following parameters:

dmarchiver -id "629006352329760768" -di -dg

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259069816, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObgJIbUzWNVUDzWfSbV6BipkFvUeeks5q8CtTgaJpZM4KXrfn .

Mincka commented 7 years ago

On some rare occasions, the script may have an error due to a connection issue.

Just open a Terminal (command screen) and copy paste the following: /Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg

The script will download again the 50,000 messages of your thread but this time, a folder will be created with images and GIFs. It could take a bit longer to download. πŸ˜„

For the missing message, I'm interested to know if it has something special that could explain why you do not find it in the generated file (special characters, emojis, large message...).

sussron commented 7 years ago

For the missing threads It's actually not a very large long message. That's what's weird. Maybe I'll see if I can find the message id identifier and try it individually instead of as part of the group.

Thanks Julien Ronnie

On Nov 8, 2016 9:09 AM, "Julien Ehrhart" notifications@github.com wrote:

On some rare occasions, the script may have an error due to a connection issue.

Just open a Terminal (command screen) and copy paste the following: /Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg

The script will download again the 50,000 messages of your thread but this time, a folder will be created with images and GIFs. It could take a bit longer to download. πŸ˜„

For the missing message, I'm interested to know if it has something special that could explain why you do not find it in the generated file (special characters, emojis, large message...).

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259143614, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObq9Zixy0_ztSXZEVJJA9tcFIGNmsks5q8IDMgaJpZM4KXrfn .

Mincka commented 7 years ago

You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id.

Try to run the command I've sent to you in my previous message and check if you've been able to download a complete conversation, with images this time.

sussron commented 7 years ago

oh i meant conversation not message, but let me try doing that inspect elements thing to see if i can find the missing messages. Thanks so much for your patience and helping me learn. Ronnie

On Tue, Nov 8, 2016 at 11:03 AM, Julien Ehrhart notifications@github.com wrote:

You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id.

Try to run the command I've sent to you in my previous message and check you've been able to download a complete conversation, with images this time.

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259177783, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObixyFTjRxkJUsNhBkq9Y5S6s5SyZks5q8J1agaJpZM4KXrfn .

sussron commented 7 years ago

Ok so it's running now on a single thread and looks to be processing more tweets (this one is up to 75,000 now and counting) that may have done the trick. I'm so stinkin excited!! Thank you thank you thank you! You rock! Ronnie

On Nov 8, 2016 11:07 AM, "Julien Ehrhart" notifications@github.com wrote:

You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id.

Try to run the command I've sent to you in my previous message and check you've been able to download a complete conversation, with images this time.

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259177783, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObixyFTjRxkJUsNhBkq9Y5S6s5SyZks5q8J1agaJpZM4KXrfn .

Mincka commented 7 years ago

I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. :stuck_out_tongue_closed_eyes: You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. πŸ˜„

You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).

sussron commented 7 years ago

​127,555 messages in one conversation thread

On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com wrote:

I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. 😝 You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. πŸ˜„

You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259233843, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn .

sussron commented 7 years ago

you did it. you did it!!!! Woo hoo!!!!! That conversation means the world to me, you can't even begin to know. thank you soo much

On Tue, Nov 8, 2016 at 2:35 PM, Ronnie Sussman sussron@gmail.com wrote:

​127,555 messages in one conversation thread

On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com wrote:

I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. 😝 You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. πŸ˜„

You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259233843, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn .

sussron commented 7 years ago

i tried another one, but got this error, do you know what it means?

Ronnies-MacBook-Pro:~ ronniesussman$ /Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg

Enter your username or email: beckybulldognj

Enter your password (characters will not be displayed):

Authentication succeedeed.

Conversation ID specified (629006352329760768). Retrieving only one thread.

Starting crawl of '629006352329760768'

Failed to execute script cmdline

Traceback (most recent call last):

File "dmarchiver/cmdline.py", line 70, in

File "dmarchiver/cmdline.py", line 62, in main

File "dmarchiver/core.py", line 468, in crawl

File "requests/models.py", line 826, in json

File "json/init.py", line 319, in loads

File "json/decoder.py", line 339, in decode

File "json/decoder.py", line 357, in raw_decode

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Ronnies-MacBook-Pro:~ ronniesussman$

On Tue, Nov 8, 2016 at 2:51 PM, Ronnie Sussman sussron@gmail.com wrote:

you did it. you did it!!!! Woo hoo!!!!! That conversation means the world to me, you can't even begin to know. thank you soo much

On Tue, Nov 8, 2016 at 2:35 PM, Ronnie Sussman sussron@gmail.com wrote:

​127,555 messages in one conversation thread

On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com wrote:

I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. 😝 You're pushing out the limits of the tool.

Tell me how many tweets have been archived at the end on this thread. πŸ˜„

You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259233843, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn .

Mincka commented 7 years ago

Ronnie,

I've created another specific issue for this error because I consider this one solved. Could you go there and check for the questions I have regarding this new error message? Thank you.

https://github.com/Mincka/DMArchiver/issues/7