Closed SaulLu closed 2 years ago
I'm planning to merge https://github.com/bigscience-workshop/catalogue_data/pull/58 , which should make things easier for you.
@thomasw21 , I've just rebase this PR to adapt it to the new API proposed in #58. I've also included your suggestion :smile:
This PR proposes to modify the way the url is simplified before creating the hash on it for deduplication.
The first modification is to keep the id in the querys parameters.
More testing should be done to see if other query parameters in the urls may not be important to distinguish 2 examples. For example in the
lm_en_pseudocrawl-filtered_619_www_qut_edu_au
dataset, I see urls of typehttps://www.qut.edu.au/study/unit?unitCode=ERB316
. I don't know if this is an overlapping exemple withhttps://www.qut.edu.au/study/unit?unitCode=LLB346
(as is this dedup it assumes that there is an overlap).