arquivo pwa-technologies issues - Githubissues

arquivo / pwa-technologies

Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.

http://www.arquivo.pt

GNU General Public License v3.0

41 stars 7 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Monitor Arquivo.pt network

#1403 vitgou opened 5 days ago
0
link to arquivo.pt hardcoded on image search

#1402 VascoRatoFCCN closed 1 day ago
1
Add link to Catalog on sobre and www footer

#1401 dcgomes closed 1 day ago
0
Configure hadoop cluster

#1400 vitgou closed 3 weeks ago
0
Create public arquivo.pt/datasets

#1399 vitgou closed 3 weeks ago
0
Verify if autopatching is working

#1398 vitgou opened 3 weeks ago
0
Move Arquivo.pt to Let's Encrypt

#1397 vitgou closed 5 days ago
1
CompletePage doesn't work on development

#1396 VascoRatoFCCN opened 1 month ago
0
Remove webapp nutchwax support

#1395 VascoRatoFCCN opened 1 month ago
0
Add api parameter to webapp

#1394 VascoRatoFCCN closed 1 month ago
0
Add language to solr index

#1393 VascoRatoFCCN opened 1 month ago
0
Replace FCCN logo

#1392 VascoRatoFCCN closed 1 month ago
0
Add CAPTCHA - CitationSaver Service

#1391 PedroG1515 opened 2 months ago
0
Dev breaks if certificate is down

#1390 VascoRatoFCCN closed 2 months ago
1
CompletePage displaying accentuated characters as "�"

#1389 VascoRatoFCCN closed 1 month ago
1
Incomplete list of warcs when indexing for page-search

#1388 VascoRatoFCCN closed 1 day ago
1
Link broken

#1387 VascoRatoFCCN closed 3 months ago
0
"mp_" after date breaks link to web-archived version

#1386 dcgomes closed 2 months ago
2
Update p102.arquivo.pt

#1385 vitgou closed 1 month ago
1
Install browsertrix-crawler p54

#1384 vitgou closed 1 month ago
1
Code refactoring - Query Suggestion Service

#1383 PedroG1515 opened 4 months ago
0
Emails sent to the wrong address

#1382 vitgou opened 4 months ago
0
Consolidate ansible access across Arquivo.pt Infrastructure

#1381 vitgou opened 4 months ago
0
Add links to "Memória descritiva" on the 2018 Award Winners

#1380 dcgomes closed 1 month ago
1
Fix broken links on Work descriptions og 2019 Awards

#1379 dcgomes closed 1 month ago
1
Reviewing and managing machines and Ansible playbooks

#1378 PedroG1515 opened 5 months ago
0
Aggregate statistics to add to arquivo.pt/numbers

#1377 PedroG1515 opened 5 months ago
0
Review configurations Memorial

#1376 PedroG1515 closed 1 month ago
1
Review Uptime Robot and ICINGA alarms

#1375 PedroG1515 opened 5 months ago
0
Recover data from Awstats

#1374 PedroG1515 opened 5 months ago
0
Install 4 new DocumentServers

#1373 PedroG1515 closed 1 month ago
1
Add Ruffle to replay

#1372 VascoRatoFCCN closed 1 month ago
0
Move logs to a different hard drive

#1371 VascoRatoFCCN closed 5 days ago
1
Write documentation for inlinks in Wiki Github

#1370 VascoRatoFCCN opened 5 months ago
0
Create a "cite this page" link on our replay page.

#1369 VascoRatoFCCN opened 6 months ago
0
Block URL requests to textsearch using the q parameter

#1368 VascoRatoFCCN opened 6 months ago
1
SavePageNow seems to don't allow URLS containing "@" such as

#1367 dcgomes closed 3 months ago
1
SavePageNow flickers on https://www.hubermanlab.com/

#1366 dcgomes closed 5 months ago
1
Uptimerobot blocked by spellchecker (403 error)

#1365 VascoRatoFCCN closed 6 months ago
1
No text results for 2022 and onwards

#1364 itsgabrodri opened 7 months ago
3
Improve arquivo.pt acessibility

#1363 VascoRatoFCCN opened 8 months ago
0
Image search "did you mean..." redirecting to page search

#1362 VascoRatoFCCN closed 10 months ago
0
Automate image blocking

#1361 VascoRatoFCCN opened 1 year ago
0
Review image-search-api issues

#1360 VascoRatoFCCN opened 1 year ago
0
When uploading to CitationSaver a progress symbol should be displayed

#1359 dcgomes closed 1 year ago
0
Wrong label on OK button of CitationSaver UI

#1358 dcgomes closed 1 year ago
1
Update CDX filtering script to remove warc/revisits

#1357 VascoRatoFCCN closed 5 months ago
1
Create unit and functional tests for Arquivo404

#1356 VascoRatoFCCN closed 1 year ago
0
Add date range and ability to prioritize most recent memento on arquivo404

#1355 VascoRatoFCCN closed 1 year ago
1
char "~" is not accepted on Suggest form

#1354 dcgomes closed 1 year ago
2

Next