divkakwani / webcorpus

Generate large textual corpora for almost any language by crawling the web
Other
8 stars 11 forks source link

Add an example #19

Closed soumendrak closed 2 years ago

soumendrak commented 2 years ago

Please add an example on the README page on how to run the crawler with all the sources for a particular language. @gowtham1997, @divkakwani

gowtham1997 commented 2 years ago

@soumendrak I have added more documentation at https://github.com/AI4Bharat/webcorpus/pull/5 I've asked some people who use webcorpus to review it before we merge it. But you can use it as a starting point. Let us know if something is ambiguous or if you get errors while following

soumendrak commented 2 years ago

@gowtham1997 Thanks for adding the instructions. Can you please also add how to crawl all the sources defined over the source CSV files at one command?

gowtham1997 commented 2 years ago

Missed replying to this.

I've added get_crawl_commands.py and updated the documentation. This will give you a list of crawling commands to run for every source and you can run n sources at a time based on your system cores.

For a 2 core machine, we generally run 20 concurrent processes but you can experiment with less or more.