issues
search
commoncrawl
/
ia-web-commons
Web archiving utility library
Apache License 2.0
9
stars
6
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
WAT extractor: add attributes of the <html> element as metadata
#35
sebastian-nagel
opened
1 month ago
0
Upgrade to a recent Hadoop version
#34
sebastian-nagel
closed
8 months ago
1
Reduce log level of two classes called by the WAT/WET extractor to avoid that log files are flooded with multiple log messages per WARC record
#33
sebastian-nagel
closed
8 months ago
0
WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record
#32
sebastian-nagel
opened
9 months ago
3
Percent-encoded ampersands (&) in URL query string canonicalized incorrectly
#31
tfmorris
opened
10 months ago
1
Improve www\d*. prefix handling
#30
tfmorris
opened
10 months ago
4
Removal of leading WWWnnn. in URL canonicalization is too aggressive
#29
tfmorris
opened
10 months ago
0
Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6
#28
tfmorris
opened
10 months ago
1
Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction
#27
sebastian-nagel
opened
1 year ago
0
WET files may include binary content if HTTP Content-Type header erroneously indicates HTML
#26
sebastian-nagel
closed
1 year ago
1
Failed tests: testInterruptibility (org.archive.util.InterruptibleCharSequenceTest): exception not throw
#25
cronopioelectronico
closed
3 years ago
4
WAT generator: do not fail on missing WARC-Filename in warcinfo record
#24
sebastian-nagel
closed
4 years ago
0
WAT generator: do not fail on missing WARC-Filename in warcinfo record
#23
sebastian-nagel
closed
4 years ago
0
WET extractor: add identified natural language of text content
#22
sebastian-nagel
closed
4 years ago
6
WAT extractor: WARC-Date to indicate capture time
#21
sebastian-nagel
closed
4 years ago
1
WAT: only unescape complete XML/HTML character entities (fixes #19)
#20
sebastian-nagel
closed
4 years ago
1
WAT: only unescape complete XML/HTML character entities
#19
sebastian-nagel
closed
4 years ago
2
WAT extraction: handle duplicate HTTP response headers
#18
cldellow
opened
4 years ago
0
Replace the org.json dependency by openjson
#17
sebastian-nagel
closed
4 years ago
2
replace org.json:json with AOSP json in MetaData
#16
cldellow
closed
4 years ago
7
WAT/WET generator performance improvements
#15
sebastian-nagel
closed
4 years ago
13
WAT: unescape XML/HTML character entities
#14
sebastian-nagel
closed
5 years ago
1
[WET] Missing spaces in parsed content
#13
pipldev
closed
6 years ago
1
Java stack overflow while matching cssUrlPattern
#12
sebastian-nagel
closed
7 years ago
1
[WAT extraction] Empty HTTP header fields are filled with value from preceding field
#11
sebastian-nagel
closed
7 years ago
1
[WAT] Add rel attribute to A@/href links
#10
sebastian-nagel
closed
6 years ago
1
Complete HTML link extraction to cover all element attributes of type URI
#9
sebastian-nagel
closed
7 years ago
0
Links in onClick property not captured in WAT 'Links' metadata
#8
e271828-
closed
6 years ago
6
data-href not captured in WAT 'Links' metadata
#7
e271828-
closed
7 years ago
4
WaybackURLKeyMaker to keep non-utf8 percent encodings
#6
sebastian-nagel
opened
7 years ago
1
URLParser fails if URL contains empty port
#5
sebastian-nagel
closed
7 years ago
0
Add encoding detection to WET text extraction
#4
sebastian-nagel
closed
7 years ago
1
Add attribute `property` of HTML meta elements
#3
sebastian-nagel
closed
7 years ago
0
WEATGenerator hanging while matching cssUrlPattern
#2
sebastian-nagel
closed
7 years ago
2
StringIndexOutOfBoundsException during WAT/WET generation
#1
sebastian-nagel
closed
8 years ago
1