fake-name / xA-Scraper

69 stars 8 forks source link

Tweak dump JSON output #2 #83

Closed God-damnit-all closed 4 years ago

God-damnit-all commented 4 years ago

This was not missing from the original dump implementation, I separated it out so it would have its own key.

Nested arrays are very messy, and the fspath contains information that, unfortunately, isn't in any other database key (for NG at least), so I need to capture strings with regex patterns from it.

God-damnit-all commented 4 years ago
if ($PSVersionTable.PSVersion.Major -lt 6) {rm alias:curl}

$JSON = Get-Content -Path "$pwd/dump/ng-someuser-69-dump.json" -Raw | ConvertFrom-Json

ForEach ($art in $JSON) {
    ("$art[3]" | sls 'href="(https?://files.catbox.moe/.*?)"' -AllMatches | % {
        [array]$dlList += New-Object PsObject -Property @{
            artId = $($art[2] | sls -Pattern $($art[0]+'\\(.*?)_')).matches.groups[1]
            dlUrl = [uri]"$($_.matches.groups[1])"
        }
    })
}

$dlList.ForEach({
    if (!(Test-Path $([String]$_.artId+'_'+[IO.Path]::GetFileName($_.dlUrl)))) {
        curl -sL "$([String]$_.dlUrl)" `
            -o $([String]$_.artId+'_'+[IO.Path]::GetFileName($_.dlUrl))
    }
})

A work-in-progress so you can see exactly why I need the key where it is now.

I could use jq, but because PowerShell is cross-platform, I'm trying to avoid having any dependencies (other than curl, which every OS has now) so everyone can make use of the finished script.

fake-name commented 4 years ago

You're like, 5 levels past the point of "you should be using a proper programming language".

I don't mean to be rude, but you're canonically doing it wrong. Use a tool that properly understands json.

fake-name commented 4 years ago

so everyone can make use of the finished script.

Literally none of my systems have powershell. Basically no one on linux uses it, aside from a few weird windows admins.

fake-name commented 4 years ago

Also, I'm still not sure how you're handling the cases where there is no art (where you'll fail with a index error).

Also, what about when there's more then one file? What, exactly is in the filename you need? Is it common across all files in a post?

God-damnit-all commented 4 years ago

I don't mean to be rude, but you're canonically doing it wrong. Use a tool that properly understands json.

It does understand json, it's just a much bigger pain in the ass to do for loops within for loops within for loops to get to the proper depth.

Literally none of my systems have powershell. Basically no one on linux uses it, aside from a few weird windows admins.

It is much faster than bash and python for scripts; it uses object-based pipelines. And it's included by default on Kali.

Also, I'm still not sure how you're handling the cases where there is no art (where you'll fail with a index error).

PowerShell doesn't fail on it. The script works, try it if you're curious. It doesn't work without this particular PR though, I don't know why merging it is such a big deal.

Also, what about when there's more then one file? What, exactly is in the filename you need? Is it common across all files in a post?

The art id before the first underscore.

fake-name commented 4 years ago

It does understand json, it's just a much bigger pain in the ass to do for loops within for loops within for loops to get to the proper depth.

My point stands. Powershell is bat files with programming bolted on. Python is programming with bat files bolted on.

It is much faster than bash and python for scripts;

For you

(assuming you're just invoking other scripts, and not doing substantial data manipulation (because apparently that's hard enough to be nigh on impossible to write properly))

it uses object-based pipelines

Python (and ruby, and c++, and every other OOP language) just flat out lets you use objects. You can put them in a pipeline if you want.

And it's included by default on Kali.

Does anyone actually use Kali? I mean, besides pentesters explicitly trying to break into Windows (which explains the windows tooling).

PowerShell doesn't fail on it.

No, the export will fail, as you're indexing [0] of an empty list.

list(pfile.fspath for pfile in post.files)[0] is [][0] for contexts where there are no post.files.

This is fairly common for people who post text-only content.

The art id before the first underscore.

Hmmm. Is that not in the URL? For which site?

God-damnit-all commented 4 years ago

My point stands.

Your point was that it doesn't properly understand Json, but it does.

For you.

I meant in terms of performance. It will be faster. It iterates over a lot of files and data very, very quickly. That's why it's getting a high adoption rate with people who have to handle huge file systems and parsing very large files.

The condescension about my particular way of doing things really bugs me because it seems to me like you're the same way.

No, the export will fail, as you're indexing [0] of an empty list.

I guess it's failing silently then? I've already been using this code, I wouldn't have been able to use my script without it. I have an artist with no submissions in my newgrounds list and there was no error thrown. I wouldn't have pushed it if there were.

fake-name commented 4 years ago

The condescension about my particular way of doing things really bugs me because it seems to me like you're the same way.

That's not my intention. Python works well for me, but I wouldn't claim it's better for everyone. I think I can probably say with certainty that it's not the best way to do, well, anything, but it makes sense in my head, and I haven't come up with a better way (though I'm certainly open to alternatives).

I may indeed be somewhat derisive about powershell, but that's specifically about power shell only. To be fair, I'd probably feel similarly if you were trying to do this in bash, zsh, or another shell-first language.

Basically, (and this is very much my opinion), you're basically trying to do somewhat complex stuff in a programming language where one of the primary design criteria is to also be usable as a shell. This means there's a LOT of design compromises they have to make, which in turn render it kind of miserable for actual, practical programming.

There are similar compromises for other things. Take Matlab, as an example. It's design is heavily driven by the need to match mathematical notation that's commonly used. As such, you can do super fancy math stuff with arrays and matrices, but it makes it somewhat miserable for doing actual substantial programming. Sure, it's technically object oriented, but writing object oriented code in it is a exercise in painful boilerplate and fiddly un-obvious syntax.

Basically, what I see is that you're struggling with the implementation of something that would be quite straightforward in any sane language (python, ruby, javascript, etc...), and yet refusing to use tools that are much more appropriate for the issue at hand.

I guess it's failing silently then? I've already been using this code, I wouldn't have been able to use my script without it.

For which site? If you don't have art items without files (e.g. DA, etc), it might be fine.

fake-name commented 4 years ago

Literally first item for twitter:

durr@newxad /m/S/xaDownloader> python3 -m manage dump /media/Storage/export twit
Setting up loggers....
done
Setup
initialized manager
Dumping contents for site twit to folder <somewhere>
Creating pool
INFO: Creating engine for process! Engine name: 'MainProcess-MainThread'
Found 201 items!
Artist posts:   0%|                                                                                                          | 0/13716 [00:00<?, ?it/s]
Artists:   0%|                                                                                                                 | 0/201 [00:02<?, ?it/s]
Main - CRITICAL - Uncaught exception!
Main - CRITICAL - Uncaught exception
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/Scripts/xaDownloader/manage/__main__.py", line 105, in <module>
    go()
  File "/media/Scripts/xaDownloader/manage/__main__.py", line 101, in go
    three_arg_go(sys.argv[1], sys.argv[2], sys.argv[3])
  File "/media/Scripts/xaDownloader/manage/__main__.py", line 63, in three_arg_go
    db_manage.export_db_contents(to_path=param_1, site_name=param_2)
  File "/media/Scripts/xaDownloader/manage/db_manage.py", line 304, in export_db_contents
    str(list(pfile.fspath for pfile in post.files)[0]),
IndexError: list index out of range
God-damnit-all commented 4 years ago

Basically, (and this is very much my opinion), you're basically trying to do somewhat complex stuff in a programming language where one of the primary design criteria is to also be usable as a shell. This means there's a LOT of design compromises they have to make, which in turn render it kind of miserable for actual, practical programming.

The reason I decided to pursue this in PowerShell is because Python does not play well with anything outside of Python. Passing a huge list of URLs to lots of different programs is no issue whatsoever, and the only alternative is to support everything JDownloader can do within xA-Scraper, or bite the bullet and use something that operates within the shell.

PowerShell is actually very similar in a lot of ways to Python in terms of style and the way it handles variables and its use of objects.

Literally first item for twitter:

Yeah, I get it, but you didn't say 'it'll cause an index error' when you closed it, you said (and I'm paraphrasing here): "PowerShell is fucking shit and you're shit for using it." So, that caught my attention first and foremost.

I'm not looking forward to wrestling with sqlalchemy again. I have already developed a very deep loathing for InstrumentedLists.

fake-name commented 4 years ago

PowerShell is actually very similar in a lot of ways to Python in terms of style and the way it handles variables and its use of objects.

I mean, they're both programming languages. Written with text.

Considering what that article qualifies as commonalities, I think you'd be hard pressed to find a programming language that they'd claim isn't similar.

Yeah, I get it, but you didn't say 'it'll cause an index error' when you closed it, you said (and I'm paraphrasing here): "PowerShell is fucking shit and you're shit for using it."

I think this basically boils down to an ongoing issue entirely on my part, which is I'm fucking terrible at communicating online.

God-damnit-all commented 4 years ago

Considering what that article qualifies as commonalities, I think you'd be hard pressed to find a programming language that they'd claim isn't similar.

I suppose so, I was paying more attention to the actual similarities and not the more reaching comparisons. C#, which I probably have the most experience in, doesn't feel very much like Python or PowerShell at all, and Rust is practically a space language to me.

fake-name commented 4 years ago

The reason I decided to pursue this in PowerShell is because Python does not play well with anything outside of Python.

Uh, really? I mean, the few times I've needed to reach out for managing external tools, I've had no issues.

I suppose so, I was paying more attention to the actual similarities and not the more reaching comparisons. C#, which I probably have the most experience in, doesn't feel very much like Python or PowerShell at all, and Rust is practically a space language to me.

I can see that. To be more specific, I think it is more it's very much like any other interpreted language. Rust is trying to be better C++ (as is C#, really), and that's a very different world.


and the only alternative is to support everything JDownloader can do within xA-Scraper, or bite the bullet and use something that operates within the shell.

A python interface to jdownloader is actually something I've seriously been thinking about for a while. I spent some time at one point trying to figure out how to invoke java directly from python in hopes of using their url unshortening components, which was largely stymied by somewhat broken libraries and a lack of time on my part.

At this point, I think I'd probably just try to interact with the jdownloader web interface thing.

I'm curious what shell things you use on a regular basis? I've not personally found anything I generally want to do that can't as easily be done with python.

fake-name commented 4 years ago

Yeah, I get it, but you didn't say 'it'll cause an index error' when you closed it, you said (and I'm paraphrasing here): "PowerShell is fucking shit and you're shit for using it."

For some (not really excusing) context, I did say it'll cause an error on the original PR, before I reverted that line.

My initial reaction here was basically "why are you trying to put that (broken) thing back?"

God-damnit-all commented 4 years ago

Uh, really? I mean, the few times I've needed to reach out for managing external tools, I've had no issues.

Python has never had the best compatibility with Windows to begin with, I'm glad it's able to perform within its own bubble these days at all. Even still, I have to wrestle around with things like the virtual environments or portable installations, because Python updates roughly every other day and everything is broken now and oh pytorch requires this specific version of CUDA no one uses anymore and only 3.8 has this bugfix but if you upgrade you won't be able to use this package and everything is constantly changing syntax so you just have to put checks in checks in checks in checks and

Anyway that all aside, if you have had good experiences with desktop environments on Linux, I venture to say you're one of the lucky ones. Even the most hardcore Linux enthusiasts I've come across have to admit that anything involving a GUI is a roll of the dice and that the limitations of the kernel are a pain in the ass to deal with.

For some (not really excusing) context, I did say it'll cause an error on the original PR, before I reverted that line.

I wasn't really sure what it was in reference to, at the time. Looking back on it, it makes more sense now.

God-damnit-all commented 4 years ago

Here's a good example of something that is a complete pain in the ass to do in just about anything that isn't PowerShell:

Get-ChildItem -Recurse -Force -Directory | 
    Sort-Object -Property FullName -Descending |
    Where-Object { $($_ | Get-ChildItem -Force | Select-Object -First 1).Count -eq 0 } |
    Remove-Item -Force

This is removing empty directories (starting from the working directory) with recursion starting from the deepest directories and working outward. I've looked for similar solutions from other shell scripts and more often than not, their solution was to just run it multiple times until it gets them all. I've used it on my Linux server many times. It's very niche, but very much worth a small install with barely any dependencies.

I mean, I didn't try it on Python, but I'm always afraid if I let Python remove anything, a change in syntax in one of the libraries will delete System32.

fake-name commented 4 years ago

This is removing empty directories (starting from the working directory) with recursion starting from the deepest directories and working outward. I've looked for similar solutions from other shell scripts and more often than not, their solution was to just run it multiple times until it gets them all. I've used it on my Linux server many times. It's very niche, but very much worth a small install with barely any dependencies.

That sounds exactly like os.walk() with topdown = False.

I mean, I didn't try it on Python, but I'm always afraid if I let Python remove anything, a change in syntax in one of the libraries will delete System32.

That's probably about as likely as a change in powershell library syntax deleting System32. Lots of the system management shit in at least debian/ubuntu are written in python, so if something changed like that, it'd probably explode half the debian based boxes out there.

God-damnit-all commented 4 years ago

That sounds exactly like os.walk() with topdown = False.

Doesn't everyone use pathlib now because Python's built-in path libraries are really bad?

That's probably about as likely as a change in powershell deleting System32.

I was joking.

fake-name commented 4 years ago

I wasn't really sure what it was in reference to, at the time. Looking back on it, it makes more sense now.

Oh, I certainly was horribly unclear. Sorry about that.

Doesn't everyone use pathlib now because Python's built-in path libraries are really bad?

Beats me. I've never had problems with them, but I've also got a lot of habits that are rooted in semi-bad practice from the python 2.6 days, and are probably just flat out stupid at this point.

Also, apparently pathlib is now a built in library, so ¯\_(ツ)_/¯.

os.path is fine as long as you never need to worry about a system with a locale that's not UTF-8 or Windows.

God-damnit-all commented 4 years ago

Huh, not sure when that happened, but that's good. It's definitely the most reliable thing I've used on Python. It Just Works.

fake-name commented 4 years ago

Python does have a pretty cool habit of co-opting libraries that are super nice as part of the stdlib.