divkakwani webcorpus issues - Githubissues

divkakwani / webcorpus

Generate large textual corpora for almost any language by crawling the web

Other

8 stars 11 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

boilerpipe3 is not being installed

#20 saikat2901 opened 1 year ago
0
Add an example

#19 soumendrak closed 2 years ago
3
Odia: More newspaper sources added

#18 soumendrak closed 2 years ago
1
generate jsons on the specified output path

#17 gowtham1997 closed 3 years ago
0
make art processor multi-process

#16 divkakwani closed 4 years ago
0
Scrapyd setup

#15 divkakwani closed 4 years ago
0
Distributed Setup

#14 divkakwani opened 4 years ago
3
Consolidate Data Sources

#13 divkakwani opened 4 years ago
0
Record Crawl Events in a File

#12 divkakwani opened 4 years ago
0
Create a Dashboard

#11 divkakwani opened 4 years ago
0
Other possible sources of code & data

#10 GokulNC closed 4 years ago
2
Use metadata derived from the sitemap

#9 GokulNC opened 5 years ago
1
Sitemap.xml might not cover all the articles

#8 GokulNC closed 4 years ago
1
Handle recursive sitemaps

#7 GokulNC opened 5 years ago
4
Handle robots.txt if sitemap.xml fails

#6 GokulNC closed 4 years ago
1
Split the datasets table in README into 2

#5 GokulNC closed 4 years ago
1
Unit tests

#4 GokulNC opened 5 years ago
1
Python2 compatibility

#3 GokulNC closed 4 years ago
1
Move all scrapy global settings to settings.py

#2 GokulNC opened 5 years ago
2
Dev

#1 divkakwani closed 5 years ago
0