issues
search
divkakwani
/
webcorpus
Generate large textual corpora for almost any language by crawling the web
Other
8
stars
11
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
boilerpipe3 is not being installed
#20
saikat2901
opened
1 year ago
0
Add an example
#19
soumendrak
closed
2 years ago
3
Odia: More newspaper sources added
#18
soumendrak
closed
2 years ago
1
generate jsons on the specified output path
#17
gowtham1997
closed
3 years ago
0
make art processor multi-process
#16
divkakwani
closed
4 years ago
0
Scrapyd setup
#15
divkakwani
closed
4 years ago
0
Distributed Setup
#14
divkakwani
opened
4 years ago
3
Consolidate Data Sources
#13
divkakwani
opened
4 years ago
0
Record Crawl Events in a File
#12
divkakwani
opened
4 years ago
0
Create a Dashboard
#11
divkakwani
opened
4 years ago
0
Other possible sources of code & data
#10
GokulNC
closed
4 years ago
2
Use metadata derived from the sitemap
#9
GokulNC
opened
5 years ago
1
Sitemap.xml might not cover all the articles
#8
GokulNC
closed
4 years ago
1
Handle recursive sitemaps
#7
GokulNC
opened
5 years ago
4
Handle robots.txt if sitemap.xml fails
#6
GokulNC
closed
4 years ago
1
Split the datasets table in README into 2
#5
GokulNC
closed
4 years ago
1
Unit tests
#4
GokulNC
opened
5 years ago
1
Python2 compatibility
#3
GokulNC
closed
4 years ago
1
Move all scrapy global settings to settings.py
#2
GokulNC
opened
5 years ago
2
Dev
#1
divkakwani
closed
5 years ago
0