Closed God-damnit-all closed 4 years ago
Shoot, I just realized the extension is derived from fName, I'm going to have to split that.
I also tried to make eval() a little safer by enclosing it in str(). Still, I know it's not ideal, but the alternative is coding an interpreter for the filePattern key, that will convert it into a proper syntax.
@fake-name Just a heads-up, apparently Twitter is pulling a Tumblr and might be pulling a lot of adult content soon.
Fuuuuck, really? Where's you see that?
@fake-name https://outline.com/NrSV3w
I don't consume the sorts of content they're talking about banning here, but, in addition to what that article goes over:
Ok, I spent some time fiddling with the twitter interface, and it turns out I think you can view all of a user's tweets without logging in. 87d1cec uses the search facilities to try to get all tweets posted by a user.
I'm not totally sure it's correct yet. It chunks posts up by week intervals, so if someone has LOTS of tweets, it still might miss things. Please let me know if you find it missing things.
I am getting a ton of parser errors from urllib3, I'm not sure how relevant they are. Tons of stuff that looks like this:
Should fetch for 2019-03-30 05:03:26.143637 2019-04-08 05:03:26.143637
urllib3.connectionpool - DEBUG - https://twitter.com:443 "GET /i/search/timeline?vertical=default&q=from%3Asomeone%20since%3A2019-03-30%20until%3A2019-04-08&src=typd&include_available_features=1&include_entities=1&reset_error_state=false HTTP/1.1" 200 117
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\pyquery\pyquery.py", line 96, in fromstring
result = getattr(etree, meth)(context)
File "src\lxml\etree.pyx", line 3234, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1764, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 17
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 17, column 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 124, in gen_tweets
html = HTML(html=response_json['items_html'], url='bunk', default_encoding='utf-8')
File "C:\Python37\lib\site-packages\requests_html.py", line 421, in __init__
element=PyQuery(html)('html') or PyQuery(f'<html>{html}</html>')('html'),
File "C:\Python37\lib\site-packages\pyquery\pyquery.py", line 256, in __init__
elements = fromstring(context, self.parser)
File "C:\Python37\lib\site-packages\pyquery\pyquery.py", line 100, in fromstring
result = getattr(lxml.html, meth)(context)
File "C:\Python37\lib\site-packages\lxml\html\__init__.py", line 875, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Python37\lib\site-packages\lxml\html\__init__.py", line 764, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
Page: <Response [200]> {'has_more_items': False, 'items_html': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n', 'new_latent_count': 0, 'focused_refresh_interval': 30000}
Yeah, that's normal, if annoying. Basically when you hit the end of a search range inside the requests thing, it prints a bunch of debug crap, and then throws an exception (which I handle). Everything works, it's just annoying.
Actually, hold on.
Also, for some reason it's running three separate python instances for a single site instead of one now?
Also, for some reason it's running three separate python instances instead of one now?
Uh, what?
I mean, maybe. It looks like somethings doing some multiprocessing. I haven't looked too closely at how the dependencies work. ¯\_(ツ)_/¯
It certainly seems to operate correctly.
Well it doesn't look like it's causing any higher CPU load or anything, so it's not a big deal. Just something I noticed while playing with my new shell settings.
Yeah, I think it's either requests_html
or mechanicalsoup
doing \~\~things\~\~.
Really, I should drop their requirements entirely. Mostly just laaaaazy.
Getting a lot of these, too.. Also harmless?
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\sqlalchemy\engine\base.py", line 1249, in _execute_context
cursor, statement, parameters, context
File "C:\Python37\lib\site-packages\sqlalchemy\engine\default.py", line 552, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "art_tag_item_id_tag_key"
DETAIL: Key (item_id, tag)=(130436, #Sometag) already exists.
Oh, before I forget, I came across this the other day, apparently a lot of manga scrapers currently use it for getting around cloudflare. If you're ever looking into improving your WebRequest library it might be useful to look at.
Oh, before I forget, I came across this the other day, apparently a lot of manga scrapers currently use it for getting around cloudflare. If you're ever looking into improving your WebRequest library it might be useful to look at.
You mean the one that's already integrated into it?
https://github.com/fake-name/WebRequest/blob/master/WebRequest/CloudscraperMixin.py
Heh, thanks, though!
Oh, huh. I didn't realize. I assume you had to mess with the code too much to use it as a submodule?
Also, is that other error message I posted harmless?
Oh, huh. I didn't realize. I assume you had to mess with the code too much to use it as a submodule?
I just straight up use it as a dependency. All that file does is handle integrating it into the normal anti-CF flow (there are similar modules for handling it via headless Chromium, and now-depreciated PhantomJS).
Also, is that other error message I posted harmless?
I believe so. I'l see about shutting it up too.
@fake-name Uhm, is this downloading retweets? I seem to have 6,692 folders in my Twitter directory...
Uh, yes?
I have a ton of tweets I've scraped previously that use a different file pattern, so I figured it would be good to have a way of defining it in my settings instead of editing the source code after every update.
I renamed a lot of variables to shorter versions to help with configuring the pattern. For reference, what I use is
"filePattern" : "\"%s-%s-20%s%02i%02i_img%i\" % (tw_user, tw_id, tw_y, tw_m, tw_d, seq)"