Closed ccstan99 closed 1 year ago
seems to simply be
@staticmethod
def _normalize_url(url: str) -> str:
# ending '/'
url = url.rstrip("/")
# Remove http and use https consistently
url = url.replace("http://", "https://")
# Remove www
url = url.replace("https://www.", "https://")
# Remove index.html or index.htm
url = re.sub(r'/index\.html?$', '', url)
# Convert youtu.be links to youtube.com
youtube_short_match = re.match(r'https://youtu\.be/([a-zA-Z0-9_-]+)', url)
if youtube_short_match:
video_id = youtube_short_match.group(1)
url = f'https://youtube.com/watch?v={video_id}'
# Additional rules for mirror domains can be added here
# agisafetyfundamentals.com -> aisafetyfundamentals.com
url = url.replace("https://agisafetyfundamentals.com", "https://aisafetyfundamentals.com")
return url
and then we add data['url'] = self._normalize_url(data.get('url') or '')
to make_data_entry. it does change the default 'url' value from None to '' again. Would a solve (if we wanted to avoid this) be to do self._normalize_url(data.get('url') or '') or None
?
edit: tested this and yeah seems like or None
could be added to make empty strings evaluate to None since they're falsy and so default to the next element in the or chain which ends with None
although if other examples come up, let me know
A thought I'm having now is that this seems like it will be dangerously likely to make the source_url and url be different in cases where they would be expected to be the same. I could check if the source_url exists, and if so alter it like I did with the url, but then again the source_url is hard coded in places so that's a risk too
lots of things that use url as a done_key might break if they are compared at the wrong times, actually
@mruwnik Should it be closed if it's dealt with now? I'm not sure since maybe we want to improve that duplicate-url-catching further in the future, in which case we might want to keep it as an issue?
If you think there is still work to be done, then leave it open. If it's decided in the future to improve it, then a new issue can be added. So it's really up to you :D
ok, thanks! I will close it since the duplicates mentioned above have all been dealt with
Many forms of the same url aren't getting caught as duplicates. Might want to strip out some of these before generating hash_ids:
ending '/', i.e. https://openai.com/charter https://openai.com/charter/
www, i.e. https://www.aisafetyfundamentals.com/governance-blog/ https://aisafetyfundamentals.com/governance-blog
starting with http:// vs https://
ending with index.htm or index.html, i.e. http://www.domain.com/ http://www.domain.com/index.html
YouTube videos can be https://youtu.be/k_zz3239DA0 https://www.youtube.com/watch?v=k_zz3239DA0
then there's mirror domains, i.e. https://www.aisafetyfundamentals.com/ https://www.agisafetyfundamentals.com/