commoncrawl ia-web-commons issues

commoncrawl / ia-web-commons

Web archiving utility library

Apache License 2.0

9 stars 6 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

WAT extractor: add attributes of the <html> element as metadata

#35 sebastian-nagel opened 1 month ago
0
Upgrade to a recent Hadoop version

#34 sebastian-nagel closed 8 months ago
1
Reduce log level of two classes called by the WAT/WET extractor to avoid that log files are flooded with multiple log messages per WARC record

#33 sebastian-nagel closed 8 months ago
0
WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record

#32 sebastian-nagel opened 9 months ago
3
Percent-encoded ampersands (&) in URL query string canonicalized incorrectly

#31 tfmorris opened 10 months ago
1
Improve www\d*. prefix handling

#30 tfmorris opened 10 months ago
4
Removal of leading WWWnnn. in URL canonicalization is too aggressive

#29 tfmorris opened 10 months ago
0
Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6

#28 tfmorris opened 10 months ago
1
Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction

#27 sebastian-nagel opened 1 year ago
0
WET files may include binary content if HTTP Content-Type header erroneously indicates HTML

#26 sebastian-nagel closed 1 year ago
1
Failed tests: testInterruptibility (org.archive.util.InterruptibleCharSequenceTest): exception not throw

#25 cronopioelectronico closed 3 years ago
4
WAT generator: do not fail on missing WARC-Filename in warcinfo record

#24 sebastian-nagel closed 4 years ago
0
WAT generator: do not fail on missing WARC-Filename in warcinfo record

#23 sebastian-nagel closed 4 years ago
0
WET extractor: add identified natural language of text content

#22 sebastian-nagel closed 4 years ago
6
WAT extractor: WARC-Date to indicate capture time

#21 sebastian-nagel closed 4 years ago
1
WAT: only unescape complete XML/HTML character entities (fixes #19)

#20 sebastian-nagel closed 4 years ago
1
WAT: only unescape complete XML/HTML character entities

#19 sebastian-nagel closed 4 years ago
2
WAT extraction: handle duplicate HTTP response headers

#18 cldellow opened 4 years ago
0
Replace the org.json dependency by openjson

#17 sebastian-nagel closed 4 years ago
2
replace org.json:json with AOSP json in MetaData

#16 cldellow closed 4 years ago
7
WAT/WET generator performance improvements

#15 sebastian-nagel closed 4 years ago
13
WAT: unescape XML/HTML character entities

#14 sebastian-nagel closed 5 years ago
1
[WET] Missing spaces in parsed content

#13 pipldev closed 6 years ago
1
Java stack overflow while matching cssUrlPattern

#12 sebastian-nagel closed 7 years ago
1
[WAT extraction] Empty HTTP header fields are filled with value from preceding field

#11 sebastian-nagel closed 7 years ago
1
[WAT] Add rel attribute to A@/href links

#10 sebastian-nagel closed 6 years ago
1
Complete HTML link extraction to cover all element attributes of type URI

#9 sebastian-nagel closed 7 years ago
0
Links in onClick property not captured in WAT 'Links' metadata

#8 e271828- closed 6 years ago
6
data-href not captured in WAT 'Links' metadata

#7 e271828- closed 7 years ago
4
WaybackURLKeyMaker to keep non-utf8 percent encodings

#6 sebastian-nagel opened 7 years ago
1
URLParser fails if URL contains empty port

#5 sebastian-nagel closed 7 years ago
0
Add encoding detection to WET text extraction

#4 sebastian-nagel closed 7 years ago
1
Add attribute `property` of HTML meta elements

#3 sebastian-nagel closed 7 years ago
0
WEATGenerator hanging while matching cssUrlPattern

#2 sebastian-nagel closed 7 years ago
2
StringIndexOutOfBoundsException during WAT/WET generation

#1 sebastian-nagel closed 8 years ago
1