fake-name / xA-Scraper

69 stars 8 forks source link

NewGrounds embeds not captured #81

Open God-damnit-all opened 4 years ago

God-damnit-all commented 4 years ago

Just now noticed, but only the initial pic in a 'series' post is captured, artists often will include extra pictures in the submission's "description". (Unfortunately these pictures are of lower resolution than the initial, so often artists ALSO put HQ links to them pointing to imgur.com or files.catbox.moe ...)

fake-name commented 4 years ago

Can you e-mail me a instance of this? I hadn't seen that particular gallery structure.

I'm not sure what to do about general embedded stuff (this is an issue with patreon too). Scraping things like youtube and imgur is a substantially complicated task, and it's a whole lot of work I'd like to avoid.

I've throught about trying to do something like use jdownloader as an external tool for this sort of thing. Right now, I just ignore external links.

God-damnit-all commented 4 years ago

I emailed you an example.

I've throught about trying to do something like use jdownloader as an external tool for this sort of thing. Right now, I just ignore external links.

Are they included in the database? It would be nice to be able to comb the database for a certain type of link I know a CLI tool or JDownloader can handle.

fake-name commented 4 years ago

I mean, I try to save the contents of any text description, so.... maybe?

A lot of it is hard because it's basically done with freeform text input.

It'd be a pretty easy bit of SQL to dump every description from a specific user to a csv file for further poking, if you want.

\copy (
    SELECT 
        content, 
        content_structured 
    FROM 
        art_item 
    WHERE 
        artist_id IN (
            SELECT 
                id 
            FROM 
                scrape_targets 
            WHERE 
                artist_name = 'artist-name-here'
                )
    ) 
TO '~/Downloads/export.csv' CSV HEADER
God-damnit-all commented 4 years ago

CSV files are definitely workable, I just have to do regex matches, really.

I assume there's no easy way to get a separate csv file for each individual artist without doing some Python scripting?

fake-name commented 4 years ago

I assume there's no easy way to get a separate csv file for each individual artist without doing some Python scripting?

The above query is for a single artist?

You could either do python stuff, or use a bash script to dump to multiple files. The actual query can be one line, and you can pass a query and database to psql. Realistically, unless I was doing it regularly, I'd probably just hand-munge the queries in a bash script.

Possibly relevant: https://stackoverflow.com/questions/43295406/how-to-copy-to-multiple-csv-files-in-postgresql

God-damnit-all commented 4 years ago

The above query is for a single artist?

Yes I know, but doing hundreds of artists by hand isn't exactly ideal.

Realistically, unless I was doing it regularly...

That was the idea. I'll probably automate it somehow, but a lot less competently.

Possibly relevant: https://stackoverflow.com/questions/43295406/how-to-copy-to-multiple-csv-files-in-postgresql

Aha, perfect, a for-loop.

fake-name commented 4 years ago

Whoops, didn't mean to close the entire issue.

Also, now:

    print(" dump [export_path] [sitename]")
    print("     Dump the database contents for users from a specific site to [export_path]")
God-damnit-all commented 4 years ago

Oh nice thanks. I had a syntax error with the script you pasted above that I was going to ask you about, but this is a lot better. And since it's JSON, I can more easily iterate over this with jq.

I did manage to learn how to properly get a shell script for git-bash from Git for Windows, though. Turns out I'm probably still better off using PowerShell, but it could come in handy later.