Miserlou / SoundScrape

SoundCloud (and Bandcamp and Mixcloud) downloader in Python.
MIT License
1.42k stars 146 forks source link

Pauses between downloads (soundcloud) #101

Closed Lanchon closed 8 years ago

Lanchon commented 8 years ago

hi,

way older versions of soundscrape used to download one song right after the other. at some point in time this changed, and in my setup download pauses for a long time (tens of minutes) after downloading each song from soundcloud and before starting the next one.

is this a rate limiting behavior that soundcloud requires? or is metadata download (ie: artwork, etc) failing in my setup for some reason (eg: host-based ad blocker)? or is something else fishy in my setup?

thank you!

keywords: pause, delay, pauses, delays

Miserlou commented 8 years ago

Hm, that's interesting. I don't get that with mine. There is no deliberate delaying in the code.

Can you give the artist that you had the trouble with?

Lanchon commented 8 years ago

hi, thanks!

for instance: soundscrape -f https://soundcloud.com/anjunadeep/sets/the-anjunadeep-edition

output:

Track already downloaded: The Anjunadeep Edition 105 With James Grant
Track already downloaded: The Anjunadeep Edition 104 with Ryan Davis
Track already downloaded: The Anjunadeep Edition 103 With John Monkman
Track already downloaded: The Anjunadeep Edition 102 With Esteble
Track already downloaded: The Anjunadeep Edition 101 with Luca Bacchetti
Track already downloaded: The Anjunadeep Edition 100 (Part Five) with Daniel Curpen - Live from London
Track already downloaded: The Anjunadeep Edition 100 (Part Four) with Cubicolor - Live from London
Track already downloaded: The Anjunadeep Edition 100 (Part Three) with Jody Wisternoff - Live From London
Track already downloaded: The Anjunadeep Edition 100 (Part Two) with Dom Donnelly - Live From London
Track already downloaded: The Anjunadeep Edition 100 (Part One) with Yotto - Live from London
Track already downloaded: The Anjunadeep Edition 99 With Vincenzo
Track already downloaded: The Anjunadeep Edition 98 With Matthias Vogt
Track already downloaded: The Anjunadeep Edition 97 With Atish
Track already downloaded: The Anjunadeep Edition 96 With James Grant (Live from Anjunadeep in Miami @ Cafeina)
Track already downloaded: The Anjunadeep Edition 95 With Jody Wisternoff
Track already downloaded: The Anjunadeep Edition 94 With Hisham Zahran
Downloading: The Anjunadeep Edition 93 With Cubicolor
[################################] 57611/57611 - 00:01:03
Downloading: The Anjunadeep Edition 92 With Theo Kottis
[################################] 65907/65907 - 00:01:14
Downloading: The Anjunadeep Edition 91 With Heliotype
[################################] 63076/63076 - 00:01:09
Downloading: The Anjunadeep Edition 90 With James Grant
[################################] 113441/113441 - 00:02:04
Downloading: The Anjunadeep Edition 89 With Öona Dahl
[################################] 57080/57080 - 00:01:03
Downloading: The Anjunadeep Edition 88 With Gab Rhome
[################################] 56244/56244 - 00:01:02

there is maybe a 15 to 30 minute delay after each [################################] xxxxx/xxxxx - xx:xx:xx message and before the next Downloading: xxx message. no delay if track already downloaded. no other messages produced.

Miserlou commented 8 years ago

I cannot reproduce this, unfortunately. Those tracks all download fine, one after another.

What operating system/python version are you on? Have you tried

pip install soundscrape --upgrade

?

Lanchon commented 8 years ago

hmmm...

Linux [...] 4.4.0-21-generic #37~14.04.1-Ubuntu SMP Wed Apr 20 16:33:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

already on latest...

$ sudo pip install soundscrape --upgrade
Requirement already up-to-date: soundscrape in /usr/local/lib/python2.7/dist-packages
Cleaning up...

my /etc/host* files do not show any sign of alteration from official mint.

Lanchon commented 8 years ago

i'm not familiar with python, which one is this package running on?

$ python
python            python2.7-config  python3.4         python-config
python2           python2-config    python3.4m        
python2.7         python3           python3m  
$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Lanchon commented 8 years ago

while soundscrape is hung sleeping...

$ ps -A | grep soundscrape
19166 pts/5    00:01:17 soundscrape
$ pstree 19166
soundscrape
$ sudo netstat --all --program | grep 19166
$

no output produced.

Lanchon commented 8 years ago

more info while hung...

$ top -p 19166
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
19166 rod       20   0  164396  59436  40892 D   0.0  0.4   1:29.25 soundscrape

and i found this online "The processes blocked on IO are the ones marked as D in the status column (S column in top)."

$ ps -p 19166 -o wchan
WCHAN
balance_dirty_pages.isra.23

but balance_dirty_pages is part of the kernel, is it not? wow...

Lanchon commented 8 years ago
$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
...
 2  0 2749396 11265420 554052 567312    0    0     0     0 1552 13514  5  2 92  0  0
 0  0 2749396 11261948 555884 569104    0    0     0    10 1658 13651  7  2 91  0  0
 1  0 2749380 11257980 557748 571008    8    0     8 11454 1744 14251  7  3 90  0  0
 1  0 2749380 11254508 559628 572896    0    0     0     0 1722 14069  6  3 91  0  0
 0  0 2749380 11250788 561388 574856    0    0     0     0 1625 13081  6  3 92  0  0
 2  0 2749380 11247192 563400 576620    0    0     0  2376 1713 14667  6  2 91  0  0
 0  0 2749380 11244048 565088 578324    0    0     0     0 1503 12504  6  2 92  0  0
 1  0 2749380 11239948 566832 580080    0    0     0  2212 1667 13200  7  3 91  0  0
 0  0 2749380 11236536 568656 581824    0    0     0     0 1578 13558  6  2 92  0  0  <-- soundscrape was downloading up to this point
 0  1 2749380 11235204 569044 582544    0    0     0    10  822 8247  3  1 87 10  0  <-- soundscrape started waiting from this point onwards
 0  1 2749380 11230716 569044 582544    0    0     0     0  869 2328  4  1 84 11  0
 1  1 2749380 11230824 569044 582544    0    0     0     0  614 1263  2  1 86 12  0
 0  1 2749380 11230824 569044 582544    0    0     0     0  585 1224  2  1 86 12  0
 0  1 2749380 11230748 569048 582540    0    0     0    14  637 1354  2  1 85 12  0
 0  1 2749380 11230740 569048 582544    0    0     0     0  695 1593  2  1 86 11  0
 0  1 2749380 11230632 569048 582544    0    0     0     0  887 1751  3  1 83 13  0
 0  1 2749368 11230500 569048 582544    6    0     6    40  679 1643  4  1 83 12  0
 0  2 2749360 11208304 569112 588892    4    0  3506  1296  856 2065  3  2 84 11  0

...

the wa (Wait IO) column jumped from 0 to 10+... http://www.chileoffshore.com/en/interesting-articles/126-linux-wait-io-problem

Lanchon commented 8 years ago
$ ps auxf | grep soundscrape
rod      19166  0.6  0.4 136824 70932 pts/5    D+   13:30   1:42  |           \_ /usr/bin/python /usr/local/bin/soundscrape -f https://soundcloud.com/anjunadeep/sets/the-anjunadeep-edition

state D+: D = uninterruptable sleep, and + = in foreground process group. https://www.redhat.com/archives/redhat-install-list/2009-April/msg00007.html

this seems to be a kernel issue, what do you think?

Miserlou commented 8 years ago

Oh man.. you've gone to deep. I have no idea if any of that is related or not.

Is your system full? Can you give me a df /?

Lanchon commented 8 years ago

lol, no my drives are fine! :)

i'm sure this is a kernel issue. and because it started one day out of the blue, i'm pretty sure that if i downgrade my kernel this will stop. i'm not using the recommended linux mint kernel, i'm using one of the latest, i don't remember why. i could just try, but my up times rank in months and i hate to reboot my machine, lol

Miserlou commented 8 years ago

There used to be something called Splice that would let you change kernels without rebooting. I don't know if that still exists though.

Alternately: no gods, no masters, no uptimes.

Lanchon commented 8 years ago

what i need to do -because i KNOW (lol) downgrade will work- is to enable tracing in python so that we can see which statement is triggering the issue.

although this is not a bug in your software, maybe that statement could be changed as a workaround for this, without no important negative impact.

Lanchon commented 8 years ago

so i had a little time... let me first say that i have zero knowledge of python.

i ran this download command to produce a python trace to 'log'. i let it run through the first download, and when finished the script hung, as it always does. i pressed CTRL-C; the ^C code echoed on the screen but the process continued to be hung (because linux process state is 'uninterruptible sleep'). then waited 15 minutes for the process to end. this is the output:

$ python -m trace -t /usr/local/bin/soundscrape -f sashaofficial >log
[################################] 57298/57298 - 00:01:08
^CTraceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/trace.py", line 819, in <module>
    main()
  File "/usr/lib/python2.7/trace.py", line 807, in main
    t.runctx(code, globs, globs)
  File "/usr/lib/python2.7/trace.py", line 513, in runctx
    exec cmd in globals, locals
  File "/usr/local/bin/soundscrape", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/soundscrape/soundscrape.py", line 85, in main
    process_soundcloud(vargs)
  File "/usr/local/lib/python2.7/dist-packages/soundscrape/soundscrape.py", line 272, in process_soundcloud
    id3_extras=id3_extras)
  File "/usr/local/lib/python2.7/dist-packages/soundscrape/soundscrape.py", line 368, in download_tracks
    artwork_url=track['artwork_url'])
  File "/usr/local/lib/python2.7/dist-packages/soundscrape/soundscrape.py", line 956, in tag_file
    audio.save()
  File "/usr/local/lib/python2.7/dist-packages/mutagen/_util.py", line 98, in wrapper
    return func(self, h, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mutagen/_file.py", line 132, in save
    return self.tags.save(filething, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mutagen/easyid3.py", line 183, in save
    self.__id3.save(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mutagen/_util.py", line 123, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mutagen/_util.py", line 98, in wrapper
    return func(self, h, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mutagen/id3/__init__.py", line 591, in save
    insert_bytes(f, new_size - old_size, old_size)
  File "/usr/local/lib/python2.7/dist-packages/mutagen/_util.py", line 511, in insert_bytes
    file_map.move(offset + size, offset, movesize)
KeyboardInterrupt

i have tried this several times and the stack trace always ends in the same file_map.move call.

the produced log is 450+ MB in size, so i did:

$ tail -n2200 log >log-tail
$ grep soundscrape log-tail 
soundscrape.py(921):             if chunk:  # filter out keep-alive new chunks
soundscrape.py(922):                 f.write(chunk)
soundscrape.py(923):                 f.flush()
soundscrape.py(920):         for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
soundscrape.py(921):             if chunk:  # filter out keep-alive new chunks
soundscrape.py(922):                 f.write(chunk)
soundscrape.py(923):                 f.flush()
soundscrape.py(920):         for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
soundscrape.py(921):             if chunk:  # filter out keep-alive new chunks
soundscrape.py(922):                 f.write(chunk)
soundscrape.py(923):                 f.flush()
soundscrape.py(920):         for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length / 1024) + 1):
soundscrape.py(925):     return path
soundscrape.py(362):                 tag_file(path,
soundscrape.py(363):                          artist=track['user']['username'],
soundscrape.py(364):                          title=track['title'],
soundscrape.py(365):                          year=track['release_year'],
soundscrape.py(366):                          genre=track['genre'],
soundscrape.py(367):                          album=id3_extras.get('album', None),
soundscrape.py(368):                          artwork_url=track['artwork_url'])
 --- modulename: soundscrape, funcname: tag_file
soundscrape.py(943):     try:
soundscrape.py(944):         audio = EasyMP3(filename)
soundscrape.py(945):         audio.tags = None
soundscrape.py(946):         audio["artist"] = artist
soundscrape.py(947):         audio["title"] = title
soundscrape.py(948):         if year:
soundscrape.py(950):         if album:
soundscrape.py(952):         if track_number:
soundscrape.py(954):         if genre:
soundscrape.py(955):             audio["genre"] = genre
soundscrape.py(956):         audio.save()
soundscrape.py(989):     except Exception as e:
soundscrape.py(370):         except Exception as e:

the un-grepped 'log-tail' file is available here.

the issue happens in mutagen's EasyID3 library: soundscrape always hangs in the soundscrape.py(956): audio.save() line. furthermore, the problem seems to be this line in mutagen's lib.

ID3v2 tags (but not v1 tags) are stored at the beginning of the file, and when you tag an untagged file you have to 'scroll' all the file contents backwards to make space at the beginning.

i am completely guessing here, but this file_map.move call could be scrolling the complete file backwards, and for whatever reason my kernel is doing this in an extremely inefficient fashion. (it used to not be this way with older kernels.) i have an SSD and 100MB read/writes take fractions of a second, not 15+ minutes.

this looks like a problem in mutagen's ID3 library, or python, or the linux kernel. definitely not an issue of soundcrape. do you have contact with mutagen? they might be very interested in taking over this issue now.

for reference:

$ uname -a
Linux [...] 4.4.0-21-generic #37~14.04.1-Ubuntu SMP Wed Apr 20 16:33:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Lanchon commented 8 years ago

it looks like python uses C's memmove function to implement mmap's move method: https://github.com/python/cpython/blob/master/Modules/mmapmodule.c#L684

this used to work fine but for some reason became horribly inefficient on recent linux kernels (with ubuntu's C lib).

Miserlou commented 8 years ago

This is so much deeper than I ever expected to go with a SoundCloud downloader! I think having mutagen be fail-safe using temporary files would be a doubly good improvement.

Lanchon commented 8 years ago

ok, so mystery solved by the @lazka in just a few minutes, lol!

first, a relevant correction: i was downloading to a fuse-mounted NTFS partition on a rotary harddrive, not to the ext4 SSD as i previously thought i was.

this is a known regression of the 4.2+ kernels (i can confirm regression exists on 4.4) involving mmap writes on fuse-mounted partitions. i was using kernel 3.19 before, which is not affected.

Lanchon commented 8 years ago

the kernel commit fixing this issue and explaining it in depth is here: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=4bc9468f1680e799e3036a6e816ed9ecfc7d98a3