BrockDSL / AOYTK

All Our Yesterdays Toolkit
https://brockdsl.github.io/AOYTK
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Remove boilerplate removes all text content #4

Open s-langdon opened 1 year ago

s-langdon commented 1 year ago

When creating derivatives using the Derivative Generator notebook selecting "text content without boilerplate" removes all of the text content from the archives on the sample datasets ("ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc" and "ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc").

Need to determine if this is due to the boilerplate detection methods on these particular datasets or if it is due to a bug in the AOYTK code.