Closed kmike closed 7 years ago
The current corpus is too big (about 1G compressed), unfortunately. I'll check if it can be made smaller, and I'll put is on S3 anyway. Re-training the model now, I'll put it to a branch at first.
Yeah, 1GB is way too much.. S3 can be not good as a long-term solution because it costs $$, maybe we can use http://academictorrents.com/ or something like that? Someone still need to seed though
Do you recall how long does it take to run a crawl and get a similar dataset?
Pushed the model in d066986
Do you recall how long does it take to run a crawl and get a similar dataset?
The dataset is 117484 pages, so with 500 rpm it should take just 4 hours. But I have a note that crawling got much slower after some time due to scheduling issues which I never solved, so actual time was more than a day, I think.
It currently works with scikit-learn 0.18+, although the model is still serialized with joblib - see issue #13 about it, and #12 about training a classifier form scratch.
@lopuhin if the problem with crawling speed is the usual "all requests returned by scheduler are for the same domain, we hit downloader limits and do nothing" then something like https://github.com/TeamHG-Memex/linkdepth/blob/master/queues.py could help; to use it set 'scheduler_slot'
request.meta key (like this: https://github.com/TeamHG-Memex/linkdepth/blob/b5c18819f61a25e586347c04c116bcabc44067af/linkdepth.py#L98) and tell scrapy to use these custom queues:
SCHEDULER_PRIORITY_QUEUE='queues.RoundRobinPriorityQueue'
SCHEDULER_DISK_QUEUE='queues.DiskQueue'
Another option is to use frontera; it uses a thing called OverusedBuffer to fight this issue.
Yes, I think that was the problem. Thanks for the pointers!
For me the model fail to load:
I think it makes sense to either upgrade the model to use scikit-learn 0.18.1, or to put training corpus to the repository, so that the model can be updated on a client.