Closed God-damnit-all closed 4 years ago
The multiprocessing stuff is ooooooold, and frankly was a bad design in like, 2010 when I wrote it and had basically zero idea what I was doing.
I really want to get away from using the multiprocessing manager.
Well this doesn't make it any worse? It's just becoming a hassle to run the scrapers independently.
Really the multiprocessor changes are probably the most minor part of this PR, all removing those couple of bits would do is make it impossible for Windows users.
No, I mean going to just using threading only. We're not doing anything here that really needs to actually fork. The heavy lifting is basically just networking and LXML, and both of those release the GIL, so multiprocessing is kind of silly.
No, I mean going to just using threading only. We're not doing anything here that really needs to actually fork. The heavy lifting is basically just networking and LXML, and both of those release the GIL, so multiprocessing is kind of silly.
Wouldn't going to using threading only mean that if one module hits an error (often due to a backend change on one of the websites) it would cause the entire program to crash instead of that specific module?
No, the thread execution is logically independent. Barring an interpreter crash (extremely unlikely, I've only had that when I'm writing custom C++ extensions), it basically doesn't happen.
I got irritated and with trying to manage a public and private repo, so I merged back. I'm not sure how this got closed.
I'm doing some really horrible shit with git. Hold on.
It would probably be best if I resubmitted this PR on a fresh fork. I'll do that later tonight.
I attached the public fork tree, I'm not sure why I can't reopen the issue.
Sorry about all the fucking around.
It's fine, I think this time I'm just going to do a PR focused purely on the way it parses the settings. I think I know a better way to do it.
I don't like having these changes all in one commit but they're all a bit dependent on one another.
The way this works is if a site is missing from your settings.py, it's simply not added to the JOBS array. The new 'for' loop in main.py recreates the current JOBS array format, but I did have to normalize the class names to avoid a ton of 'if' statements: https://github.com/herp-a-derp/xA-Scraper/blob/0b33aa9eb8207b4d172045ba2f09d91d89c2149e/main.py#L55-L62
As for the Windows multi-processing, Python on Linux defaults to the old 'fork' method. Windows isn't compatible with that and needs to use the newer 'spawn' method. (Eventually the 'spawn' method should be forced on Linux for better compatibility, but that's for another day.)
There was two caveats. The easy one was simply declaring
multiprocessing.freeze_support()
in __main__.py. This command does literally nothing on Linux so an OS check isn't necessary.The real pain in the ass was this bit: https://github.com/herp-a-derp/xA-Scraper/blob/08975aee29470d062e319b262b225ac45d245db8/manage/scrape_manage.py#L18-L21
Having these lines auto-execute when the module is imported breaks multi-processing on Windows (and is apparently frowned upon in Python 3 for other reasons in any case). So now either it gets the namespace directly from flags.namespace (if for instance you did
fetch da
) or the namespace is passed to it through do_fetch_all.It seems to all work, but there could be unintended consequences I'm not aware of - the code is difficult for me to navigate and I am quite new to Python. Hopefully it does not break anything on Linux.