edgi-govdata-archiving / web-monitoring-db

An HTTP API for tracking and annotating changes to a set of web pages.
https://api.monitoring.envirodatagov.org/
GNU General Public License v3.0
17 stars 26 forks source link

Extract SURT into a separate gem #767

Open Mr0grog opened 3 years ago

Mr0grog commented 3 years ago

This project has a nearly complete Ruby port of the Internet Archive’s SURT Python package buried in the app/lib/ directory: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/3bb7e8a8960af75f7d05be86f58a88d055cfc79e/app/lib/surt.rb#L3-L20

I wrote it because we needed URL canonicalization tools, none of the existing Ruby ones I could find quite met our needs perfectly, and having a method that roughly matched the Internet Archive’s was advantageous. Nobody had written a Ruby port of SURT.

Since we have generally been working to break more reusable, abstract pieces out of the web monitoring projects, this is probably a really good candidate for that on the ruby side. It might be nice to extract it and publish it as a Ruby Gem. (Gem name: SURT, repo name: edgi-govdata-archiving/ruby-surt)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.