Closed soumendrak closed 2 years ago
@soumendrak I have added more documentation at https://github.com/AI4Bharat/webcorpus/pull/5 I've asked some people who use webcorpus to review it before we merge it. But you can use it as a starting point. Let us know if something is ambiguous or if you get errors while following
@gowtham1997 Thanks for adding the instructions. Can you please also add how to crawl all the sources defined over the source CSV files at one command?
Missed replying to this.
I've added get_crawl_commands.py
and updated the documentation. This will give you a list of crawling commands to run for every source and you can run n
sources at a time based on your system cores.
For a 2 core machine, we generally run 20 concurrent processes but you can experiment with less or more.
Please add an example on the README page on how to run the crawler with all the sources for a particular language. @gowtham1997, @divkakwani