Provide a cleaned URL - Githubissues

Some services include garbage in their URLs. For example, Instagram posts linked on the profile page have a useless taken-by parameter carrying the username, and Facebook recently (a few weeks ago) started including some (most likely tracking) parameters __xts__, __tn__, and eid in the post links. While the Instagram case is not too problematic, since that parameter has a meaning and is constant, the Facebook one is definitely undesired since those parameters are carried on when forwarding link to another person or software and can probably be used to identify the scraping user. When using snscrape for archival purposes, it can also make it very difficult to find the archived page later.

The Item instances yielded by the modules should carry both the original and the cleaned URL, and the CLI should provide options to print either or both of these variants. The default should presumably be to print the cleaned URL.

What exactly is garbage and what isn't still needs to be figured out though. For example, photo posts on Facebook typically include a type=3 parameter, which I believe determines the way the picture is displayed. I'm not sure if this should be stripped or not.

JustAnotherArchivist / snscrape

Provide a cleaned URL #10