Improve catching duplicate urls

ccstan99 commented 1 year ago

Many forms of the same url aren't getting caught as duplicates. Might want to strip out some of these before generating hash_ids:

ending '/', i.e. https://openai.com/charter https://openai.com/charter/
www, i.e. https://www.aisafetyfundamentals.com/governance-blog/ https://aisafetyfundamentals.com/governance-blog
starting with http:// vs https://
ending with index.htm or index.html, i.e. http://www.domain.com/ http://www.domain.com/index.html
YouTube videos can be https://youtu.be/k_zz3239DA0 https://www.youtube.com/watch?v=k_zz3239DA0
then there's mirror domains, i.e. https://www.aisafetyfundamentals.com/ https://www.agisafetyfundamentals.com/

Thomas-Lemoine commented 1 year ago

seems to simply be

    @staticmethod
    def _normalize_url(url: str) -> str:
        # ending '/'
        url = url.rstrip("/")

        # Remove http and use https consistently
        url = url.replace("http://", "https://")

        # Remove www
        url = url.replace("https://www.", "https://")

        # Remove index.html or index.htm
        url = re.sub(r'/index\.html?$', '', url)

        # Convert youtu.be links to youtube.com
        youtube_short_match = re.match(r'https://youtu\.be/([a-zA-Z0-9_-]+)', url)
        if youtube_short_match:
            video_id = youtube_short_match.group(1)
            url = f'https://youtube.com/watch?v={video_id}'

        # Additional rules for mirror domains can be added here

        # agisafetyfundamentals.com -> aisafetyfundamentals.com
        url = url.replace("https://agisafetyfundamentals.com", "https://aisafetyfundamentals.com")

        return url

and then we add data['url'] = self._normalize_url(data.get('url') or '') to make_data_entry. it does change the default 'url' value from None to '' again. Would a solve (if we wanted to avoid this) be to do self._normalize_url(data.get('url') or '') or None?

edit: tested this and yeah seems like or None could be added to make empty strings evaluate to None since they're falsy and so default to the next element in the or chain which ends with None

Thomas-Lemoine commented 1 year ago

although if other examples come up, let me know

Thomas-Lemoine commented 1 year ago

A thought I'm having now is that this seems like it will be dangerously likely to make the source_url and url be different in cases where they would be expected to be the same. I could check if the source_url exists, and if so alter it like I did with the url, but then again the source_url is hard coded in places so that's a risk too

Thomas-Lemoine commented 1 year ago

lots of things that use url as a done_key might break if they are compared at the wrong times, actually

Thomas-Lemoine commented 1 year ago

@mruwnik Should it be closed if it's dealt with now? I'm not sure since maybe we want to improve that duplicate-url-catching further in the future, in which case we might want to keep it as an issue?

mruwnik commented 1 year ago

If you think there is still work to be done, then leave it open. If it's decided in the future to improve it, then a new issue can be added. So it's really up to you :D

Thomas-Lemoine commented 1 year ago

ok, thanks! I will close it since the duplicates mentioned above have all been dealt with

StampyAI / alignment-research-dataset

Improve catching duplicate urls #163