bigscience-workshop / catalogue_data

Scripts to prepare catalogue data
Apache License 2.0
8 stars 1 forks source link

Add deduplication on url level #36

Closed thomasw21 closed 2 years ago

thomasw21 commented 2 years ago

I still need to figure how to remove typical patterns, {url} vs {url}/commentaires for example.

TevenLeScao commented 2 years ago

Can we use eval on the meta field instead? After looking it up, it seems some datasets have an actual dict rather than a dict's string, but it's nothing that we can't get around with e.g.

try:
    print(eval(dataset[0]["meta"]).keys())
except TypeError:
    print(dataset[0]["meta"].keys())

And then you can just do meta["url"] instead of the regex, which feels less brittle. I'd also suggest splitting on ? and doing url = meta["url"].split("?")[0] as I've seen urls in the pseudocrawl that differ only by access artifacts, for example:

'https://www.mediapart.fr/journal/france/261017/sivens-les-chiffres-qui-montrent-une-justice-deux-vitesses?onglet=full' 'https://www.mediapart.fr/journal/france/261017/sivens-les-chiffres-qui-montrent-une-justice-deux-vitesses'

thomasw21 commented 2 years ago

I'd be surprised if not all pseudocrawl were consistent, ie half are dict, and the other half strings. The regex is applied on the url, and does exactly what you said. The reason why I went with the weird regex pattern was that I wanted to try fixing urls that go have a /commentaires for example, which basically seems to be the same in the media part example (still haven't figured that one out yet).