Closed thomasw21 closed 2 years ago
Can we use eval
on the meta field instead? After looking it up, it seems some datasets have an actual dict rather than a dict's string, but it's nothing that we can't get around with e.g.
try:
print(eval(dataset[0]["meta"]).keys())
except TypeError:
print(dataset[0]["meta"].keys())
And then you can just do meta["url"]
instead of the regex, which feels less brittle. I'd also suggest splitting on ?
and doing url = meta["url"].split("?")[0]
as I've seen urls in the pseudocrawl that differ only by access artifacts, for example:
'https://www.mediapart.fr/journal/france/261017/sivens-les-chiffres-qui-montrent-une-justice-deux-vitesses?onglet=full'
'https://www.mediapart.fr/journal/france/261017/sivens-les-chiffres-qui-montrent-une-justice-deux-vitesses'
I'd be surprised if not all pseudocrawl were consistent, ie half are dict, and the other half strings.
The regex is applied on the url, and does exactly what you said. The reason why I went with the weird regex pattern was that I wanted to try fixing urls that go have a /commentaires
for example, which basically seems to be the same in the media part example (still haven't figured that one out yet).
I still need to figure how to remove typical patterns,
{url}
vs{url}/commentaires
for example.