JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.42k stars 706 forks source link

Provide a cleaned URL #10

Closed JustAnotherArchivist closed 5 years ago

JustAnotherArchivist commented 6 years ago

Some services include garbage in their URLs. For example, Instagram posts linked on the profile page have a useless taken-by parameter carrying the username, and Facebook recently (a few weeks ago) started including some (most likely tracking) parameters __xts__, __tn__, and eid in the post links. While the Instagram case is not too problematic, since that parameter has a meaning and is constant, the Facebook one is definitely undesired since those parameters are carried on when forwarding link to another person or software and can probably be used to identify the scraping user. When using snscrape for archival purposes, it can also make it very difficult to find the archived page later.

The Item instances yielded by the modules should carry both the original and the cleaned URL, and the CLI should provide options to print either or both of these variants. The default should presumably be to print the cleaned URL.

What exactly is garbage and what isn't still needs to be figured out though. For example, photo posts on Facebook typically include a type=3 parameter, which I believe determines the way the picture is displayed. I'm not sure if this should be stripped or not.

JustAnotherArchivist commented 5 years ago

According to my tests today, the type=3 parameter does not have any visual influence on the resulting page. Photo and video URLs on Facebook also have an extraneous path component like /username/photos/a.12345/67890/?... (instead of a, it can also have a number of different values, and the part after it can be ridiculously long rather than just a shortish number). The clean URL should probably be /username/photos/67890/.

In general, I think I'll go with the minimal URL that doesn't affect the page significantly.