fake-name / xA-Scraper

69 stars 8 forks source link

TwitGet: file pattern from settings #72

Closed God-damnit-all closed 4 years ago

God-damnit-all commented 4 years ago

I have a ton of tweets I've scraped previously that use a different file pattern, so I figured it would be good to have a way of defining it in my settings instead of editing the source code after every update.

I renamed a lot of variables to shorter versions to help with configuring the pattern. For reference, what I use is "filePattern" : "\"%s-%s-20%s%02i%02i_img%i\" % (tw_user, tw_id, tw_y, tw_m, tw_d, seq)"

God-damnit-all commented 4 years ago

Shoot, I just realized the extension is derived from fName, I'm going to have to split that.

God-damnit-all commented 4 years ago

I also tried to make eval() a little safer by enclosing it in str(). Still, I know it's not ideal, but the alternative is coding an interpreter for the filePattern key, that will convert it into a proper syntax.

God-damnit-all commented 4 years ago

@fake-name Just a heads-up, apparently Twitter is pulling a Tumblr and might be pulling a lot of adult content soon.

fake-name commented 4 years ago

Fuuuuck, really? Where's you see that?

God-damnit-all commented 4 years ago

@fake-name https://outline.com/NrSV3w

I don't consume the sorts of content they're talking about banning here, but, in addition to what that article goes over:

  1. Twitter has been consistently making access to adult content much tighter throughout the year.
  2. More than a few artists I follow who, again, don't produce that sorts of content, have been having to file appeals to Twitter over suspensions with increasing frequency.
  3. COPPA holds companies liable for shit they shouldn't be be held liable for and so said companies take a 'better safe than sorry' approach to dealing with adult content.
fake-name commented 4 years ago

Ok, I spent some time fiddling with the twitter interface, and it turns out I think you can view all of a user's tweets without logging in. 87d1cec uses the search facilities to try to get all tweets posted by a user.

I'm not totally sure it's correct yet. It chunks posts up by week intervals, so if someone has LOTS of tweets, it still might miss things. Please let me know if you find it missing things.

God-damnit-all commented 4 years ago

I am getting a ton of parser errors from urllib3, I'm not sure how relevant they are. Tons of stuff that looks like this:

Should fetch for  2019-03-30 05:03:26.143637 2019-04-08 05:03:26.143637
urllib3.connectionpool - DEBUG - https://twitter.com:443 "GET /i/search/timeline?vertical=default&q=from%3Asomeone%20since%3A2019-03-30%20until%3A2019-04-08&src=typd&include_available_features=1&include_entities=1&reset_error_state=false HTTP/1.1" 200 117
Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\pyquery\pyquery.py", line 96, in fromstring
    result = getattr(etree, meth)(context)
  File "src\lxml\etree.pyx", line 3234, in lxml.etree.fromstring
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1764, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 17
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 17, column 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 124, in gen_tweets
    html = HTML(html=response_json['items_html'], url='bunk', default_encoding='utf-8')
  File "C:\Python37\lib\site-packages\requests_html.py", line 421, in __init__
    element=PyQuery(html)('html') or PyQuery(f'<html>{html}</html>')('html'),
  File "C:\Python37\lib\site-packages\pyquery\pyquery.py", line 256, in __init__
    elements = fromstring(context, self.parser)
  File "C:\Python37\lib\site-packages\pyquery\pyquery.py", line 100, in fromstring
    result = getattr(lxml.html, meth)(context)
  File "C:\Python37\lib\site-packages\lxml\html\__init__.py", line 875, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "C:\Python37\lib\site-packages\lxml\html\__init__.py", line 764, in document_fromstring
    "Document is empty")
lxml.etree.ParserError: Document is empty
Page:  <Response [200]> {'has_more_items': False, 'items_html': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n', 'new_latent_count': 0, 'focused_refresh_interval': 30000}
fake-name commented 4 years ago

Yeah, that's normal, if annoying. Basically when you hit the end of a search range inside the requests thing, it prints a bunch of debug crap, and then throws an exception (which I handle). Everything works, it's just annoying.

Actually, hold on.

God-damnit-all commented 4 years ago

Also, for some reason it's running three separate python instances for a single site instead of one now?

fake-name commented 4 years ago

Also, for some reason it's running three separate python instances instead of one now?

Uh, what?

I mean, maybe. It looks like somethings doing some multiprocessing. I haven't looked too closely at how the dependencies work. ¯\_(ツ)_/¯

It certainly seems to operate correctly.

God-damnit-all commented 4 years ago

Well it doesn't look like it's causing any higher CPU load or anything, so it's not a big deal. Just something I noticed while playing with my new shell settings.

fake-name commented 4 years ago

Yeah, I think it's either requests_html or mechanicalsoup doing \~\~things\~\~.

Really, I should drop their requirements entirely. Mostly just laaaaazy.

God-damnit-all commented 4 years ago

Getting a lot of these, too.. Also harmless?

Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\sqlalchemy\engine\base.py", line 1249, in _execute_context
    cursor, statement, parameters, context
  File "C:\Python37\lib\site-packages\sqlalchemy\engine\default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "art_tag_item_id_tag_key"
DETAIL:  Key (item_id, tag)=(130436, #Sometag) already exists.
God-damnit-all commented 4 years ago

Oh, before I forget, I came across this the other day, apparently a lot of manga scrapers currently use it for getting around cloudflare. If you're ever looking into improving your WebRequest library it might be useful to look at.

https://github.com/venomous/cloudscraper

fake-name commented 4 years ago

Oh, before I forget, I came across this the other day, apparently a lot of manga scrapers currently use it for getting around cloudflare. If you're ever looking into improving your WebRequest library it might be useful to look at.

https://github.com/venomous/cloudscraper

You mean the one that's already integrated into it?

https://github.com/fake-name/WebRequest/blob/master/WebRequest/CloudscraperMixin.py

Heh, thanks, though!

God-damnit-all commented 4 years ago

Oh, huh. I didn't realize. I assume you had to mess with the code too much to use it as a submodule?

Also, is that other error message I posted harmless?

fake-name commented 4 years ago

Oh, huh. I didn't realize. I assume you had to mess with the code too much to use it as a submodule?

I just straight up use it as a dependency. All that file does is handle integrating it into the normal anti-CF flow (there are similar modules for handling it via headless Chromium, and now-depreciated PhantomJS).

Also, is that other error message I posted harmless?

I believe so. I'l see about shutting it up too.

God-damnit-all commented 4 years ago

@fake-name Uhm, is this downloading retweets? I seem to have 6,692 folders in my Twitter directory...

fake-name commented 4 years ago

Uh, yes?