fake-name / xA-Scraper

69 stars 8 forks source link

Restructure fetch-all & fix multiprocessing on Windows #56

Closed God-damnit-all closed 4 years ago

God-damnit-all commented 4 years ago

I don't like having these changes all in one commit but they're all a bit dependent on one another.

The way this works is if a site is missing from your settings.py, it's simply not added to the JOBS array. The new 'for' loop in main.py recreates the current JOBS array format, but I did have to normalize the class names to avoid a ton of 'if' statements: https://github.com/herp-a-derp/xA-Scraper/blob/0b33aa9eb8207b4d172045ba2f09d91d89c2149e/main.py#L55-L62

As for the Windows multi-processing, Python on Linux defaults to the old 'fork' method. Windows isn't compatible with that and needs to use the newer 'spawn' method. (Eventually the 'spawn' method should be forced on Linux for better compatibility, but that's for another day.)

There was two caveats. The easy one was simply declaring multiprocessing.freeze_support() in __main__.py. This command does literally nothing on Linux so an OS check isn't necessary.

The real pain in the ass was this bit: https://github.com/herp-a-derp/xA-Scraper/blob/08975aee29470d062e319b262b225ac45d245db8/manage/scrape_manage.py#L18-L21

Having these lines auto-execute when the module is imported breaks multi-processing on Windows (and is apparently frowned upon in Python 3 for other reasons in any case). So now either it gets the namespace directly from flags.namespace (if for instance you did fetch da) or the namespace is passed to it through do_fetch_all.

It seems to all work, but there could be unintended consequences I'm not aware of - the code is difficult for me to navigate and I am quite new to Python. Hopefully it does not break anything on Linux.

herp-a-derp commented 4 years ago

The multiprocessing stuff is ooooooold, and frankly was a bad design in like, 2010 when I wrote it and had basically zero idea what I was doing.

I really want to get away from using the multiprocessing manager.

God-damnit-all commented 4 years ago

Well this doesn't make it any worse? It's just becoming a hassle to run the scrapers independently.

Really the multiprocessor changes are probably the most minor part of this PR, all removing those couple of bits would do is make it impossible for Windows users.

herp-a-derp commented 4 years ago

No, I mean going to just using threading only. We're not doing anything here that really needs to actually fork. The heavy lifting is basically just networking and LXML, and both of those release the GIL, so multiprocessing is kind of silly.

God-damnit-all commented 4 years ago

No, I mean going to just using threading only. We're not doing anything here that really needs to actually fork. The heavy lifting is basically just networking and LXML, and both of those release the GIL, so multiprocessing is kind of silly.

Wouldn't going to using threading only mean that if one module hits an error (often due to a backend change on one of the websites) it would cause the entire program to crash instead of that specific module?

fake-name commented 4 years ago

No, the thread execution is logically independent. Barring an interpreter crash (extremely unlikely, I've only had that when I'm writing custom C++ extensions), it basically doesn't happen.

fake-name commented 4 years ago

I got irritated and with trying to manage a public and private repo, so I merged back. I'm not sure how this got closed.

fake-name commented 4 years ago

I'm doing some really horrible shit with git. Hold on.

God-damnit-all commented 4 years ago

It would probably be best if I resubmitted this PR on a fresh fork. I'll do that later tonight.

fake-name commented 4 years ago

I attached the public fork tree, I'm not sure why I can't reopen the issue.

fake-name commented 4 years ago

Sorry about all the fucking around.

God-damnit-all commented 4 years ago

It's fine, I think this time I'm just going to do a PR focused purely on the way it parses the settings. I think I know a better way to do it.