Closed denisjacquemin closed 7 years ago
Hi there, I basically have the same error, trying to download a huge DM thread:
Conversation ID specified (xxxxx). Retrieving only one thread.
Starting crawl of 'xxxxx'
Traceback (most recent call last):
File "/usr/local/bin/dmarchiver", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main
args.download_gifs)
File "/usr/local/lib/python3.5/site-packages/dmarchiver/core.py", line 463, in crawl
tweets, download_images, download_gif)
File "/usr/local/lib/python3.5/site-packages/dmarchiver/core.py", line 377, in _process_tweets
document = lxml.html.fragment_fromstring(value)
File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
base_url=base_url, **kw)
File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
Note that the script has been able to download perfectly a short thread (just a few DM, no images no nothing).
Hello Laurent,
Are you also using macOS? It seems there is an error with the lxml library when it reaches a message with accented characters. Could you confirm there is no accented characters for the short thread which is working for you?
It's quite difficult for me to identify the exact cause because I do not own a Mac to debug it. It works properly on Windows and Linux. I keep looking for a possible fix for macOS.
There's a command to run for the UTF-8 support in the Terminal which should be executed before the script but I'm not sure it would make a difference here:
export PYTHONIOENCODING=utf-8
Hi, As I said via email (I thought it would be also posted here, whatever), I do have more or less the same conf: Mac OS 10.11.6, Python 3.5, lxml 3.6.4. Unfortunately, the short thread that worked also contains accented characters (damn french people), so that's probably not about thatβ¦
I tried to execute the command you gave, but the problem is still there.
Thanx for the help, it would be really cool to have this script work.
I'm going to add a raw mode to fetch JSON responses without using the parser. I will also add a verbose mode and add proper error handling. I hope it will help us to find the root cause. Thanks for the tests.
Zupa. Keep up the good work, looking forward to testing it :)
(BTW, just tested the windows exe on a basic Windows 10 Family, worked perfectly fine with every king of DM thread⦠good job) (but it seems than the GMT is not correct, like the french +2 are missing)
Yep. I've already updated the script to use the time of the locale instead of the UTC one. It has not been pushed yet to GitHub. And for the error, it confirms the issue is related to the macOS setup.
Thanks to a friend of mine with a Mac, I've been able to track down what seems to be the root cause of this bug.
The parsing fails when a tweet contains an emoji. The generated code will look like this for the image. <img title="Visage avec des larmes de joie" class="Emoji Emoji--forText" draggable="false" aria-label="Emoji: Visage avec des larmes de joie" alt="π" src="https://abs.twimg.com/emoji/v2/72x72/1f602.png">
It contains the alt
attribute with the unicode character of the smiley (π).
With this new information, I've found this bug ticket with a similar issue: https://bugs.launchpad.net/lxml/+bug/1538213
Additional tests have been done on macOS and no issue has been identified with multiple kinds of accented characters or URL. This issue only seems to occur with emoji unicode.
Consequently, I'm going to do the following: 1) Implement a platform specific workaround for Mac OS with platform detection.
from sys import platform
# Mac OS lxml bug workaround
if platform == "darwin":
# Inject emojis' titles into alt attributes, replacing unicode tweet's emojis
# to prevent encoding error with lxml while keeping a coherent alt attribute
value = re.sub('title="(.*?)".*?class="Emoji.*?alt="(.*?)"', '\1', value)
or simpler alternative
if platform == "darwin":
# Clear alt attributes of emojis
value = re.sub(r'(class="Emoji.*?)alt=".*?"', r'\g<1> alt=""', value)
2) Add a proper try / catch for the parsing 3) Complete the bug ticket
\o/
Could you just confirm there was no emoji for the thread you've been able to parse on macOS, Laurent?
Yes, it was an old and short thread with no emojis at the timeβ¦
Having the exact same problem. Happy to hear you're working on a fix!
\o/ (not using emoji in order not to break anything ;-)
I think I have a fix in b7c316a for the Mac OS users but I need confirmation guys. You can now upgrade the package and test again. π
$ pip3 install dmarchiver --upgrade
$ dmarchiver
I did. Got a little further this time: 3 images (instead of 0), 0 text files. Error:
Authentication succeedeed. Conversation ID specified (123). Retrieving only one thread. Starting crawl of '123' Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in
load_entry_point('dmarchiver==0.0.7', 'console_scripts', 'dmarchiver')() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main args.download_gifs) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 463, in crawl tweets, download_images, download_gif) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 377, in _process_tweets document = lxml.html.fragment_fromstring(value) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 825, in fragment_fromstring base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 786, in fragments_fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 752, in document_fromstring value = etree.fromstring(html, parser, **kw) File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737) File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674) File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220) File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345) File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584) File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238) File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147) lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
Maybe something went wrong with the update? I got this:
Collecting dmarchiver Downloading dmarchiver-0.0.8.zip Collecting requests>=2.11.1 (from dmarchiver) Using cached requests-2.11.1-py2.py3-none-any.whl Collecting lxml>=3.6.4 (from dmarchiver) Using cached lxml-3.6.4.tar.gz Collecting cssselect>=0.9.2 (from dmarchiver) Using cached cssselect-1.0.0-py2.py3-none-any.whl Installing collected packages: requests, lxml, cssselect, dmarchiver Exception: Traceback (most recent call last): File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/basecommand.py", line 215, in main status = self.run(options, args) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/commands/install.py", line 342, in run prefix=options.prefix_path, File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_set.py", line 784, in install kwargs File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_install.py", line 849, in install self.move_wheel_files(self.source_dir, root=root, prefix=prefix) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/req/req_install.py", line 1062, in move_wheel_files isolated=self.isolated, File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/wheel.py", line 345, in move_wheel_files clobber(source, lib_dir, True) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/wheel.py", line 316, in clobber ensure_dir(destdir) File "/Library/Python/2.7/site-packages/pip-9.0.0-py2.7.egg/pip/utils/init**.py", line 83, in ensure_dir os.makedirs(path) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/requests'
@muesliq: It seems you're using the wrong version of Python (2.7 instead of 3.5). Could you try with pip3 install dmarchiver --upgrade
?
That's my fault. It's mandatory to specify pip3
for Mac OS X because both version are installed. I've updated my previous post.
And I guess you've been able to download more images only because those images have been uploaded recently, without emojis in tweets in or after them.
Updated, thanks! Better now but not fixed yet. Thousands of tweets processed, 129 images, yet still 0 text files.
Authentication succeedeed. Conversation ID specified (123). Retrieving only one thread. Starting crawl of '123' Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in
load_entry_point('dmarchiver==0.0.8', 'console_scripts', 'dmarchiver')() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 62, in main args.download_gifs) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 470, in crawl tweets, download_images, download_gif) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 384, in _process_tweets document = lxml.html.fragment_fromstring(value) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 825, in fragment_fromstring base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 786, in fragments_fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, _kw) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/init.py", line 752, in document_fromstring value = etree.fromstring(html, parser, **kw) File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737) File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674) File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220) File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345) File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584) File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238) File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147) lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
Ok, thanks. I've added an exception handling to print the tweet ID that raises the exception. The script should now continue, even when a tweet is causing issues.
You can upgrade with pip3 install dmarchiver --upgrade
.
This is a poor, temporary solution but the raw HTML of the offensive tweets will be also output in the log file as a [DMConversationEntry] with a [ParseError] tag. It will help me to understand what's causing the issue.
The only weird situation I saw is a random position of the img attributes that makes the regex fail. I've seen title
before alt
on a computer and after alt
on another... Maybe that's the same here with class
or it's possible it could be emoji used in cards or other content types.
Now the upgrade doesn't seem to work:
pip3 install dmarchiver --upgrade Requirement already up-to-date: dmarchiver in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages Requirement already up-to-date: requests>=2.11.1 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver) Requirement already up-to-date: lxml>=3.6.4 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver) Requirement already up-to-date: cssselect>=0.9.2 in /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (from dmarchiver)
I had the same issue. It's quite strange. Maybe a temporary issue with pipy?
I've been able to uninstall it and reinstall it with the latest version (0.0.10).
To exclude caching issues for package download, I've also deleted the following folder on Windows:
C:\Users\[User]\AppData\Local\pip\cache
For Unix, its seems to be ~/.pip/cache/
but I'm not sure.
Hi ! No problem with the upgrade here, and I had been able to archive a few DM threads, including big ones with emoji, pictures⦠Nice!
On error though, with one thread. Had a lot of
Unexpected error for tweet 'xxxx', but still I continue.
The twitter user has an emoji in her username (see below begining of the file that has been written)
[DMConversationEntry] [ParseError] Parsing of tweet 'xxxx' failed. Raw HTML: <div class="DirectMessage
DirectMessage--received
clearfix dm js-dm-item"
data-quick-reply-json="null"
data-message-id="xxxx"
data-item-id="xxxx"
data-card-component="dm_existing_conversation_dialog"
data-component-context="dm_existing_conversation_dialog">
<div class="DirectMessage-container">
<div class="DirectMessage-avatar">
<a href="/xxxx" class="js-action-profile js-user-profile-link" data-user-id="xxxx">
<div class="DMAvatar DMAvatar--1 u-chromeOverflowFix">
<span class="DMAvatar-container">
<img class="DMAvatar-image" src="xxxx alt="SabineLC π">
</span>
</div>
I guess it might be the problem..?
We're getting there!
pip3 install dmarchiver --upgrade --ignore-installed
seems to have done the trick. And it works just fabulous! You managed to fix the bugs, kudos!
Two tweets (out of 12620) hat an "unexpected error". The first one contained the letter π. The second had the following tweet embedded (which contained lots of emoji): https://twitter.com/magnifier661/status/787044538145574912
Thanks a lot @LaurentLC and @muesliq! π
You've been able to identify 3 currently not properly handled cases:
I'm not sure yet how I will be able to find proper workarounds. The bug is in the lxml lib for Mac OS. Identifying emojis with regex does not seem possible. The error with π (U+1D70B π MATHEMATICAL ITALIC SMALL PI) also means that the issue will not be limited to emojis. It's only a simple character so it could mean the script cannot handle non-ASCII characters at all on Mac OS... :-/
Update: My guess is the error is related to code points encoded on four bytes. https://en.wikipedia.org/wiki/Unicode
Code points in Planes 1 through 16 (supplementary planes) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8.
Emojis are also encoded in Plane 1 (1F000ββ1FFFF) so I may drop all content in the range 10000-β2FFFF (Planes 1 & 2). It contains mainly ancient Egyptian characters, mathematical symbols and emojis.
For reference: http://stackoverflow.com/a/13752628/3049282
By the way: Fantastic little piece of software. Thank you!
Happy to help. π
I have implemented in 073a3589280ee513b404051a4b1c68f80ccbb590 a more general solution as a "fix" for this issue. On Mac OS X, all the Unicode characters encoded on 4 bytes are now replaced by "β‘" before the lxml parsing.
Consequently, it should fix all the encountered issues and allow a flawless parsing. π
To celebrate this, I've bumped the version to 0.1.0. π
Rejoice Mac users, I've been able to make a precompiled executable for macOS. It should be a lot easier for non-technical users to use. π https://github.com/Mincka/DMArchiver/releases/tag/0.1.0
Fixed in 073a3589280ee513b404051a4b1c68f80ccbb590
OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?
On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart notifications@github.com wrote:
Closed #1 https://github.com/Mincka/DMArchiver/issues/1.
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .
this is what it looked like as it was running before it got the error
On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:
OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?
On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart notifications@github.com wrote:
Closed #1 https://github.com/Mincka/DMArchiver/issues/1.
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .
now i got this screen
On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:
this is what it looked like as it was running before it got the error
On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:
OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?
On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart <notifications@github.com
wrote:
Closed #1 https://github.com/Mincka/DMArchiver/issues/1.
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .
oh it didn't let me attach the 5MB file of the one particular message thread.
But here are all the various threads that were in the command screen. The most important one is the Starting crawl of '629006352329760768'
Last login: Mon Nov 7 20:43:14 on ttys000
Ronnies-MacBook-Pro:~ ronniesussman$ /Users/ronniesussman/Downloads/dmarchiver ; exit;
Enter your username or email: beckybulldognj
Enter your password (characters will not be displayed):
Authentication succeedeed.
Conversation ID not specified. Retrieving all the threads.
Starting crawl of '629006352329760768'
Begin of thread reached
Total processed tweets: 49899
Writing conversation to 629006352329760768.txt
[Truncated for confidentiality reasons]
logout
Saving session...
...copying shared history...
...saving history...truncating history files...
...completed.
[Process completed]
On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:
Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?
Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.
Julien, thanks so much. Ronnie from New Jersey
On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:
now i got this screen
On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:
this is what it looked like as it was running before it got the error
On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:
OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?
On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:
Closed #1 https://github.com/Mincka/DMArchiver/issues/1.
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .
Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?
Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.
Julien, thanks so much. Ronnie from New Jersey
On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:
Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?
Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.
Julien, thanks so much. Ronnie from New Jersey
On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:
now i got this screen
On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:
this is what it looked like as it was running before it got the error
On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:
OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?
On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:
Closed #1 https://github.com/Mincka/DMArchiver/issues/1.
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .
I'm not sure all the messages were backed up. i'm looking for 2 particular ones that i can't find, but i'm going to go through all the txt files and see that i didn't miss it.
Thanks! Ronnie
On Mon, Nov 7, 2016 at 9:12 PM, Ronnie Sussman sussron@gmail.com wrote:
Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?
Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.
Julien, thanks so much. Ronnie from New Jersey
On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:
Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?
Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.
Julien, thanks so much. Ronnie from New Jersey
On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:
now i got this screen
On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:
this is what it looked like as it was running before it got the error
On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:
OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?
On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:
Closed #1 https://github.com/Mincka/DMArchiver/issues/1.
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .
Does seem it didn't capture all the conversations or go to the first line. Will note which message id if I can locate it on the source element page.
Thanks Ronnie
On Nov 7, 2016 9:37 PM, "Ronnie Sussman" sussron@gmail.com wrote:
I'm not sure all the messages were backed up. i'm looking for 2 particular ones that i can't find, but i'm going to go through all the txt files and see that i didn't miss it.
Thanks! Ronnie
On Mon, Nov 7, 2016 at 9:12 PM, Ronnie Sussman sussron@gmail.com wrote:
Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?
Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.
Julien, thanks so much. Ronnie from New Jersey
On Mon, Nov 7, 2016 at 9:09 PM, Ronnie Sussman sussron@gmail.com wrote:
Wow so i tried it a second time and WOW!! it ran through the process. I'm so very very excited!!! here is the number of message threads it found and backed up ( pasted it to a word document). I noticed the message threads don't go back to inception, just a certain date. For example the one i'm attaching starts May 2016 and the conversation was started August 2015, does this have a time limit?
Trust me so i'm excited to have any of these, even in text version without the images videos or photos in any capacity(although with photos and videos would be INCREDIBLE), I was just curious.
Julien, thanks so much. Ronnie from New Jersey
On Mon, Nov 7, 2016 at 8:58 PM, Ronnie Sussman sussron@gmail.com wrote:
now i got this screen
On Mon, Nov 7, 2016 at 8:44 PM, Ronnie Sussman sussron@gmail.com wrote:
this is what it looked like as it was running before it got the error
On Mon, Nov 7, 2016 at 8:42 PM, Ronnie Sussman sussron@gmail.com wrote:
OMGoodness I was so excited it was backing up messages with this new download and it all looked to be going and then i got an error screen, do you know what this means?
On Fri, Nov 4, 2016 at 10:53 AM, Julien Ehrhart < notifications@github.com> wrote:
Closed #1 https://github.com/Mincka/DMArchiver/issues/1.
β You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#event-847999450, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObkm6vP2cx3bxvI-V2zCqowv5NVKdks5q60bhgaJpZM4KXrfn .
Hello Ronnie,
Glad to see you're getting better results. However, I am still no sure to understand what is the error message you're talking about. There is not known limitation of the thread size. If there is an error, it should appear in the generated file. Messages deleted by the users cannot be recovered.
If you want to download images and GIFs from your specific conversation (629006352329760768), you should try to run the command with the following parameters:
dmarchiver -id "629006352329760768" -di -dg
You should also be careful of the information sent on this site. The conversation ID for a conversation between two people is "userid1-userid2," so it could be possible to know with who you're talking to on Twitter.
Thanks for the message Julien. The error happened the first time but then it ran. I can see the dm messages in my twitter account so they aren't deleted. I can do a screen shot to show you. For the one long one It just takes a long time to scroll back.
That great script you wrote was awesome I could put in my name and password and it just went and did its thing. So cool! How would I now run it just for one conversation with images. Just go to the command screen and type that line instead of using the zip link I downloaded?
Thanks Ronnie
On Nov 8, 2016 2:57 AM, "Julien Ehrhart" notifications@github.com wrote:
Hello Ronnie,
Glad to see you're getting better results. However, I am still no sure to understand what is the error message you're talking about. There is not known limitation of the thread size. If there is an error, it should appear in the generated file. Messages deleted by the users cannot be recovered.ca
If you want to download images and GIFs from your specific conversation (629006352329760768), you should try to run the command with the following parameters:
dmarchiver -id "629006352329760768" -di -dg
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259069816, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObgJIbUzWNVUDzWfSbV6BipkFvUeeks5q8CtTgaJpZM4KXrfn .
On some rare occasions, the script may have an error due to a connection issue.
Just open a Terminal (command screen) and copy paste the following: /Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg
The script will download again the 50,000 messages of your thread but this time, a folder will be created with images and GIFs. It could take a bit longer to download. π
For the missing message, I'm interested to know if it has something special that could explain why you do not find it in the generated file (special characters, emojis, large message...).
For the missing threads It's actually not a very large long message. That's what's weird. Maybe I'll see if I can find the message id identifier and try it individually instead of as part of the group.
Thanks Julien Ronnie
On Nov 8, 2016 9:09 AM, "Julien Ehrhart" notifications@github.com wrote:
On some rare occasions, the script may have an error due to a connection issue.
Just open a Terminal (command screen) and copy paste the following: /Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg
The script will download again the 50,000 messages of your thread but this time, a folder will be created with images and GIFs. It could take a bit longer to download. π
For the missing message, I'm interested to know if it has something special that could explain why you do not find it in the generated file (special characters, emojis, large message...).
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259143614, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObq9Zixy0_ztSXZEVJJA9tcFIGNmsks5q8IDMgaJpZM4KXrfn .
You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id.
Try to run the command I've sent to you in my previous message and check if you've been able to download a complete conversation, with images this time.
oh i meant conversation not message, but let me try doing that inspect elements thing to see if i can find the missing messages. Thanks so much for your patience and helping me learn. Ronnie
On Tue, Nov 8, 2016 at 11:03 AM, Julien Ehrhart notifications@github.com wrote:
You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id.
Try to run the command I've sent to you in my previous message and check you've been able to download a complete conversation, with images this time.
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259177783, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObixyFTjRxkJUsNhBkq9Y5S6s5SyZks5q8J1agaJpZM4KXrfn .
Ok so it's running now on a single thread and looks to be processing more tweets (this one is up to 75,000 now and counting) that may have done the trick. I'm so stinkin excited!! Thank you thank you thank you! You rock! Ronnie
On Nov 8, 2016 11:07 AM, "Julien Ehrhart" notifications@github.com wrote:
You cannot specify a specific message id, the tool can only accept a conversation (or "thread") id.
Try to run the command I've sent to you in my previous message and check you've been able to download a complete conversation, with images this time.
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259177783, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObixyFTjRxkJUsNhBkq9Y5S6s5SyZks5q8J1agaJpZM4KXrfn .
I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. :stuck_out_tongue_closed_eyes: You're pushing out the limits of the tool.
Tell me how many tweets have been archived at the end on this thread. π
You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).
β127,555 messages in one conversation thread
On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com wrote:
I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. π You're pushing out the limits of the tool.
Tell me how many tweets have been archived at the end on this thread. π
You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259233843, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn .
you did it. you did it!!!! Woo hoo!!!!! That conversation means the world to me, you can't even begin to know. thank you soo much
On Tue, Nov 8, 2016 at 2:35 PM, Ronnie Sussman sussron@gmail.com wrote:
β127,555 messages in one conversation thread
On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com wrote:
I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. π You're pushing out the limits of the tool.
Tell me how many tweets have been archived at the end on this thread. π
You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259233843, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn .
i tried another one, but got this error, do you know what it means?
Ronnies-MacBook-Pro:~ ronniesussman$ /Users/ronniesussman/Downloads/dmarchiver -id "629006352329760768" -di -dg
Enter your username or email: beckybulldognj
Enter your password (characters will not be displayed):
Authentication succeedeed.
Conversation ID specified (629006352329760768). Retrieving only one thread.
Starting crawl of '629006352329760768'
Failed to execute script cmdline
Traceback (most recent call last):
File "dmarchiver/cmdline.py", line 70, in
File "dmarchiver/cmdline.py", line 62, in main
File "dmarchiver/core.py", line 468, in crawl
File "requests/models.py", line 826, in json
File "json/init.py", line 319, in loads
File "json/decoder.py", line 339, in decode
File "json/decoder.py", line 357, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Ronnies-MacBook-Pro:~ ronniesussman$
On Tue, Nov 8, 2016 at 2:51 PM, Ronnie Sussman sussron@gmail.com wrote:
you did it. you did it!!!! Woo hoo!!!!! That conversation means the world to me, you can't even begin to know. thank you soo much
On Tue, Nov 8, 2016 at 2:35 PM, Ronnie Sussman sussron@gmail.com wrote:
β127,555 messages in one conversation thread
On Tue, Nov 8, 2016 at 2:25 PM, Julien Ehrhart notifications@github.com wrote:
I wouldn't guess people have some crazy conversations going on thanks to Twitter DMs. π You're pushing out the limits of the tool.
Tell me how many tweets have been archived at the end on this thread. π
You can already check the downloaded images in your "Downloads" folder, a new folder "629006352329760768" should have been created with the pictures and GIFs (as MP4 files).
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/1#issuecomment-259233843, or mute the thread https://github.com/notifications/unsubscribe-auth/AVxObpSFyODfzuPK_LuTa4gO4qwt6TiJks5q8MzFgaJpZM4KXrfn .
Ronnie,
I've created another specific issue for this error because I consider this one solved. Could you go there and check for the questions I have regarding this new error message? Thank you.
Edited by Mincka on August 10th 2017: For anybody Googling for this error message
XMLSyntaxError: switching encoding: encoder error
:Possible workarounds: 1) Strip the emojis on macOS before the parsing, see this implementation in 073a3589280ee513b404051a4b1c68f80ccbb590 2) Downgrade to Python 3.4 if you can. I attempted to upgrade to Python 3.6 but had other compatibility issues, this time with pyinstaller, so I was unable to move forward. Downgrade to Python 3.4 allow my tool to work perfectly on all platforms. 3) Remove lxml package and reinstall it using
STATIC_DEPS=true
(https://github.com/lorien/grab/issues/199#issuecomment-297721800). However, I cannot guarantee this will work. Using multiple Python versions on macOS is such a huge pain. πOriginal message: My setup: