Eltaurus-Lt / CourseDump2022

Google Chrome extension to download Memrise courses as csv files
97 stars 8 forks source link

Downloaded courses missing content #39

Closed anki-sync closed 3 weeks ago

anki-sync commented 8 months ago

First of all I love this extension, the batch downloader is fantastic. I was able to pull a large number of courses (almost 500) with very low effort. They were all from a single publisher so it was just a matter of extracting all the course names from the main page.

But that's when the issues appeared. I found 10 courses lacking any content. Blank CSV's no media... This is pretty concerning as I don't know what caused it. No errors or anything appeared.

I went ahead and re-downloaded those 10 courses and 9 of them appeared with content. Finally I turned off batch and downloaded the last one a 3rd time and it brought in content.

But now I am concerned I may be missing content. If the entire content is missing its easy to discover. But if some content is missing it would be hard to know. Is there any way to go through and verify the courses? I would like to know if I ended up with everything, including all cards and all media files.

Eltaurus-Lt commented 8 months ago

Can you share the contents of your queue.txt file?

anki-sync commented 8 months ago

Sure, queue.txt is the 2nd one I used consisting of the 10 courses that did not download: queue.txt

queue-full.txt is the full set that I used to download everything: queue-full.txt

anki-sync commented 8 months ago

I suppose I should also mention, I did a git pull before starting to be sure I was running the latest code. Just ran it again and its still the latest version.

Eltaurus-Lt commented 2 months ago

The issues you were experiencing are likely to fetch request timeouts, which is why they didn't behave consistently and were reduced when fewer courses had been queued at the same time. This is somewhat to be expected from the old script since it tried to retrieve the data for all courses from the list simultaneously, so courses in large queues had to contest with each other for bandwidth.

It took a while, but I just finished making a new version that completely reworks the scanning part of the script. All courses in a batch are distributed between a limited number of scanning threads now, properly spreading the load over time.

I suggest you try it out if you still have something to download from Memrise, or want to make a more robust backup than the previous one.

Is there any way to go through and verify the courses? I would like to know if I ended up with everything, including all cards and all media files.

In the new version the total count of learnable words is appended to the end of the csv filename (and also inside info.md if metadata is downloaded). If you've started the course on Memrise, or a course is not split into levels, the counter will appear as a plain number in brackets without any markings. If there is a discrepancy between the number of items that the script managed to scrap and the number indicated in the metadata, the counter will indicate both numbers, e.g. "(123 of 456)", and also throw a warning during download. Memrise doesn't show the total number of words for the courses, which have levels and haven't yet been started, so there is no way for the script to independently verify, whether all words have been downloaded. In such a case, the counter will show a figure based on the total number of the retrieved items, prepended by "~" to indicate an estimate. The easiest way to check if it is correct would be to go and start the respective course on Memrise and use the displayed number of words in the course for comparison.

As for media files, the total count for them is appended to the end of the _media folder's name. If everything goes right, it should be equal to the number of files inside that folder.