make some deduplication earlier to reduce time spent collecting m3u8

Valentin-Metz / tum_video_scraper

Download and jumpcut lecture videos from https://live.rbg.tum.de/ and https://tum.cloud.panopto.eu/

55 stars 5 forks source link

make some deduplication earlier to reduce time spent collecting m3u8 #4

Closed atticus-sullivan closed 2 years ago

atticus-sullivan commented 2 years ago

Currently as far as I can see deduplication is only done after collecting the m3u8 links. It is more efficient to deduplicate the video_urls retrieved from the folder view (by doing so no video m3u8 is searched for multiple times)

atticus-sullivan commented 2 years ago

Not quite sure if the deduplication later is still necessary (thus I left it there up to now).

Valentin-Metz commented 2 years ago

If I am not mistaken converting this to a set would erase the order we have collected the urls in. This would make the order of download random.

atticus-sullivan commented 2 years ago

Yes that's right. To me the order of the download doesn't matter as typically I try to stay up to date with the videos, so most of the time there is only one or two videos to download.

The background behind the desire to avoid capturing the m3u8 urls multiple times is that in the case of panopto and with #6 this takes quite a while (some sleeps).

Is there any reason why the download order would be important?

Note: I also implemented some caching (storing some sort of map video_id -> download url and title) in this regard (no quite ready for a PR) since some lectures upload many videos on panopto at once and link them only week for week on moodle (so checking for/downloading new videos takes quite some time when all m3u8 urls of the already downloaded videos have to be captured).

Valentin-Metz commented 2 years ago

As some lecturers have published different lectures with identical titles, we require the order to either rename the duplicate or, as we do it now, number all lectures in sequence for them to be sortable in folders.

Note that we must not at any point rely on a filename / title for separating playlist files or skip acquiring a URL based on a filename, as they are not unique.

This is why the de-duplication currently happens after we have acquired all the .m3u8 links, as these are required to identify unique lectures.

atticus-sullivan commented 2 years ago

Note that we must not at any point rely on a filename / title for separating playlist files or skip acquiring a URL based on a filename, as they are not unique. But I think the video_id should be, right? Since by using the set deduplication is done based on the video_ids.

But I agree this doesn't solve the issue regarding enumerating files with the same title in the right order.

Valentin-Metz commented 2 years ago

Deduplication has been optimized in the latest release. URLs are now deduplicated in an order preserving way before they are accessed.