archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Implement Python versions of Matchbox utilities #408

Closed ruebot closed 4 years ago

ruebot commented 4 years ago
RDD Scala DF Python DF
ComputeImageSize
ComputeMD5RDD ComputeMD5DF in progress
ComputeSHA1RDD ComputeSHA1DF in progress
DetectLanguageRDD DetectLanguageDF in progress
DetectMimeTypeTika DetectMimeTypeTikaDF
ExtractBoilerPipeTextRDD ExtractBoilerPipeTextDF
ExtractDateRDD ExtractDateDF
ExtractDomainRDD ExtractDomainDF :heavy_check_mark:
ExtractImageDetails
ExtractImageLinksRDD ExtractImageLinksDF
ExtractLinksRDD ExtractLinksDF
ExtractTextFromPDFs -
GetExtensionMimeRDD GetExtensionMimeDF
RemoveHTMLRDD RemoveHTMLDF :heavy_check_mark:
RemoveHTTPHeaderRDD RemoveHTTPHeaderDF :heavy_check_mark:
NERClassifier -
RemovePrefixWWW RemovePrefixWWWDF :heavy_check_mark:

Stealing @SinghGursimran's very helpful tables here :smiley: