Closed JustAnotherArchivist closed 5 years ago
According to my tests today, the type=3
parameter does not have any visual influence on the resulting page.
Photo and video URLs on Facebook also have an extraneous path component like /username/photos/a.12345/67890/?...
(instead of a
, it can also have a number of different values, and the part after it can be ridiculously long rather than just a shortish number). The clean URL should probably be /username/photos/67890/
.
In general, I think I'll go with the minimal URL that doesn't affect the page significantly.
Some services include garbage in their URLs. For example, Instagram posts linked on the profile page have a useless
taken-by
parameter carrying the username, and Facebook recently (a few weeks ago) started including some (most likely tracking) parameters__xts__
,__tn__
, andeid
in the post links. While the Instagram case is not too problematic, since that parameter has a meaning and is constant, the Facebook one is definitely undesired since those parameters are carried on when forwarding link to another person or software and can probably be used to identify the scraping user. When using snscrape for archival purposes, it can also make it very difficult to find the archived page later.The
Item
instances yielded by the modules should carry both the original and the cleaned URL, and the CLI should provide options to print either or both of these variants. The default should presumably be to print the cleaned URL.What exactly is garbage and what isn't still needs to be figured out though. For example, photo posts on Facebook typically include a
type=3
parameter, which I believe determines the way the picture is displayed. I'm not sure if this should be stripped or not.