issues
search
coherentdigital
/
coherencebot
Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Allow CDN domain in seed configuration
#15
PeterCiuffetti
opened
2 years ago
0
Update use of Collections API to receive a collection UUID instead of a Collection Slug
#14
PeterCiuffetti
closed
3 years ago
1
Migrate document selection from ES Queries to CoherenceBot
#13
PeterCiuffetti
closed
3 years ago
1
Develop a test tool for checking Org seed URLs and Published PDFs
#12
PeterCiuffetti
closed
3 years ago
1
Export collection metadata to S3 directly from CoherenceBot
#11
PeterCiuffetti
opened
3 years ago
1
Store the HTML page URL pointing at the PDF in the artifact metadata
#10
PeterCiuffetti
closed
3 years ago
2
Optimize the role of HTML URLs in CoherenceBot
#9
PeterCiuffetti
opened
3 years ago
1
Permit the management of Collection Configs and Seed URLs via a Commons API
#8
PeterCiuffetti
closed
3 years ago
5
Research and Document Elastic Map Reduce sizing, cost of operation and other dev ops parameters
#7
PeterCiuffetti
closed
3 years ago
2
Deploy multiple CoherenceBot clusters, one per major region
#6
PeterCiuffetti
closed
3 years ago
2
Export runtime monitoring to a dashboard
#5
PeterCiuffetti
opened
3 years ago
1
Export CoherenceBot crawl statistics to a dashboard
#4
PeterCiuffetti
opened
3 years ago
1
Use date_published field instead of date_updated
#3
avorio
closed
3 years ago
2
Weeeeiiiiirrrrdddd repetition in the title field
#2
avorio
closed
2 years ago
5
Remove file extension and common prefixes from titles
#1
avorio
opened
3 years ago
1