archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Remove ExtractImageDetailsDF.scala #464

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

ExtractImageDetailsDF.scala came in while we were working on #223. At the time it was needed, but now it is basically redundant with ImageInformationExtractor.scala.

  1. ImageInformationExtractor.scala is tied to the ImageInformationExtractor spark-submit job
  2. ExtractImageDetailsDF is standalone, and is not documented anywhere other than doc comments.

The big difference between the two is ExtractImageDetailsDF includes the bytes column.

I propose we remove this now prior to the 1.0.0 release, and in the future if there is a demand for binary extract jobs, we can add them. Or just add a flag to the current jobs to include the bytes column :wink:

@lintool @ianmilligan1 let me know if there is any strong disagreement with removing this.

ianmilligan1 commented 4 years ago

This rationale makes sense to me! I'll look at the PR, and can merge unless any strong objections from @lintool.