new way to simplify dedup url

SaulLu commented 2 years ago

This PR proposes to modify the way the url is simplified before creating the hash on it for deduplication.

The first modification is to keep the id in the querys parameters.

More testing should be done to see if other query parameters in the urls may not be important to distinguish 2 examples. For example in the lm_en_pseudocrawl-filtered_619_www_qut_edu_au dataset, I see urls of type https://www.qut.edu.au/study/unit?unitCode=ERB316. I don't know if this is an overlapping exemple with https://www.qut.edu.au/study/unit?unitCode=LLB346 (as is this dedup it assumes that there is an overlap).

thomasw21 commented 2 years ago

I'm planning to merge https://github.com/bigscience-workshop/catalogue_data/pull/58 , which should make things easier for you.

SaulLu commented 2 years ago

@thomasw21 , I've just rebase this PR to adapt it to the new API proposed in #58. I've also included your suggestion :smile:

bigscience-workshop / catalogue_data

new way to simplify dedup url #57