issues
search
archivesunleashed
/
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137
stars
33
forks
source link
Implement Python versions of Matchbox utilities
#408
Closed
ruebot
closed
4 years ago
ruebot
commented
4 years ago
RDD
Scala DF
Python DF
ComputeImageSize
ComputeMD5RDD
ComputeMD5DF
in progress
ComputeSHA1RDD
ComputeSHA1DF
in progress
DetectLanguageRDD
DetectLanguageDF
in progress
DetectMimeTypeTika
DetectMimeTypeTikaDF
ExtractBoilerPipeTextRDD
ExtractBoilerPipeTextDF
ExtractDateRDD
ExtractDateDF
ExtractDomainRDD
ExtractDomainDF
:heavy_check_mark:
ExtractImageDetails
ExtractImageLinksRDD
ExtractImageLinksDF
ExtractLinksRDD
ExtractLinksDF
ExtractTextFromPDFs
-
GetExtensionMimeRDD
GetExtensionMimeDF
RemoveHTMLRDD
RemoveHTMLDF
:heavy_check_mark:
RemoveHTTPHeaderRDD
RemoveHTTPHeaderDF
:heavy_check_mark:
NERClassifier
-
RemovePrefixWWW
RemovePrefixWWWDF
:heavy_check_mark:
Stealing @SinghGursimran's very helpful tables here :smiley:
Stealing @SinghGursimran's very helpful tables here :smiley: