bibanon / tubeup

Use yt-dlp to download video and upload to the Internet Archive with metadata.
https://pypi.python.org/pypi/tubeup/
GNU General Public License v3.0
410 stars 71 forks source link

Tubeup constant S3 overload #82

Closed ghost closed 5 years ago

ghost commented 5 years ago

On some youtube videos tubeup appear to constantly do upload retries. This can go on seemingly indefinitely if left to itself; Where the video is constantly being uploaded at every try, but then fails with :

warning: s3 is overloaded, sleeping for 30 seconds and retrying. 9001 retries left.

brandongalbraith commented 5 years ago

This is expected behavior based on Internet Archive capacity and any service disruptions.

On Fri, Dec 14, 2018 at 9:31 PM Duck Hunt-Pr0 notifications@github.com wrote:

On some youtube videos tubeup appear to constantly do upload retries. This can go on seemingly indefinitely if left to itself; Where the video is constantly being uploaded at every try, but then fails with :

warning: s3 is overloaded, sleeping for 30 seconds and retrying. 9001 retries left.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bibanon/tubeup/issues/82, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVarYD5pQ9KL2v96nDY1KE8MCd7YhNZks5u5F7rgaJpZM4ZUh7q .

ghost commented 5 years ago

This is expected behavior based on Internet Archive capacity and any service disruptions. On Fri, Dec 14, 2018 at 9:31 PM Duck Hunt-Pr0 @.***> wrote: On some youtube videos tubeup appear to constantly do upload retries. This can go on seemingly indefinitely if left to itself; Where the video is constantly being uploaded at every try, but then fails with : warning: s3 is overloaded, sleeping for 30 seconds and retrying. 9001 retries left. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#82>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVarYD5pQ9KL2v96nDY1KE8MCd7YhNZks5u5F7rgaJpZM4ZUh7q .

That doesn't seem to be the case. It can seemingly go on retrying indefinitely (for week+) , even though several uploads utilizing tubeup, from the same computer, can be started and completed successfully in the meantime.

vxbinaca commented 5 years ago

Your assumption is wrong. Tubeup is acting exactly as it should and not killing your upload. It's waiting in line like I designed it to.

Be patient. Excessive S3 errors should be reported to the Internet Archive, which I/we am not apart of.

Closing because this isn't a problem with Tubeup, but a safety feature working exactly as it should.

ghost commented 5 years ago

For every s3 is overloaded at end of a video upload, it restarts the upload; Only to then again fail at 100%. Maybe a check could be implemented? Or adding more delay than 30 minutes for each failed retry?

killing your upload. it does not, as you say. But, it does upload them over and over again in such situations.

vxbinaca commented 5 years ago

9001 retries every 9001 seconds isn't high enough for you?

ghost commented 5 years ago

9001 retries every 9001 seconds isn't high enough for you?

warning: s3 is overloaded, sleeping for **30 seconds** and retrying. 9001 retries left.

That's 30 seconds, before another retry

vxbinaca commented 5 years ago

That's baked into internetarchive and I don't believe that can be configured upward. Again email IA staff about excessive S3 waits. We're not staff we can't fix the underlying problems at IA.

vxbinaca commented 5 years ago

@DuckHP 3 days is what my cocktail napkin math says is the maximum before Tubeup fails. Did your upload fail or were you just annoyed it was taking so long?

ghost commented 5 years ago

@DuckHP 3 days is what my cocktail napkin math says is the maximum before Tubeup fails. Did your upload fail or were you just annoyed it was taking so long?

Not so much annoyed, since i use a shell script to semi-automate/queue tubeup. But as in the instances it uploads the same video, then fails, then uploads the same entire video file again after 30 seconds, only to fail again, repeatedly. And when a video is 1GB+ at times, it feels like quite a lot of traffic is being generated (IMO) unnecessarily.

And in my experience 3 days is not the extent of how long such a loop of retries can go on, but i'll try to do a time tubeup <url> , and check within the next few days.

Maybe a "busyness" check could be done with e.g a small dummy-file beforehand (e.g an ffmpeg trimmed snippet of the first 5 seconds of the video), to see if the S3 is currently overloaded, and then replace it with the full video if/when it's not?

vxbinaca commented 5 years ago

No we don't need to add code because it can just wait in line.

It's not failing to upload you're just failing to wait for it to upload. It's failed when it returns you to prompt.

ghost commented 5 years ago

No we don't need to add code because it can just wait in line.

It's not failing to upload you're just failing to wait for it to upload. It's failed when it returns you to prompt.

I don't sit around and wait. But, I would think that a continuous re-uploading of data of relatively large size would be considered unnecessary network traffic, poor resource usage, and an ineffective way of doing uploads.

Anyway, if it's considered a non-issue, then it's a non-issue on my part.

brandongalbraith commented 5 years ago

This is the preferred behavior per the Internet Archive, as that’s how their internetarchive module (which tuneup relies on) is configured. You might have better luck engaging Internet Archive staff directly to discuss their backoff and retry preferences if you believe there’s room for improvement.

On Sat, Dec 15, 2018 at 1:05 PM Duck Hunt-Pr0 notifications@github.com wrote:

No we don't need to add code because it can just wait in line.

It's not failing to upload you're just failing to wait for it to upload. It's failed when it returns you to prompt.

I don't sit around and wait. But, I would think that a continuous re-uploading of data of relatively large size would be considered unnecessary traffic, poor resource usage, and an ineffective way of doing uploads.

Anyway, if it's considered a non-issue, then it's a non-issue on my part.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bibanon/tubeup/issues/82#issuecomment-447586851, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVarWuBWLuK5llIkpvZmmhLrybdFLMjks5u5Tn6gaJpZM4ZUh7q .

vxbinaca commented 5 years ago

I don't sit around and wait.

You're going to have to start. The alternative is I remove the waits, the upload fails silently and it fills your disk with failed uploads. Then you would rightfully be making issues about a problem in Tubeup.

I saw all this years ago and planned accordingly. I stabilized downloads and uploads to get rid of problems you don't see anymore in Tubeup like downloads snapping off or upload failure that returns to prompt.

But, I would think that a continuous re-uploading of data of relatively large size would be considered unnecessary network traffic, poor resource usage, and an ineffective way of doing uploads.

You would be wrong. Preservation is the goal not network efficiency, therefor it was configured to sit in line and wait it's turn. I can't raise the waits further IIRC it breaks Tubeup. I tried. 9001 is cute because it's a reference to the DBZ meme.

If you don't want to impact your home internet connection then rent a Linux VPS. That's how I do my constant ripping.

In the future I will be closing s3 complaint issues on sight.

prof-frink commented 5 years ago

Just a thought, seeing as I was having a similar issue not too long ago. In the end, I waited over 3 days and then killed it since I recall you saying that it shouldn't take any longer than that before it returns to prompt, yet it never did return in my case (for some mysterious reason).

At any rate, I decided to try using my other account (my archivist account) to see if it made any difference. Lo and behold, I've since uploaded almost 2000 videos with nary an issue. The only time I ever had an s3 issue, it was like wertercatt's issue #37, where it just stated:

s3 is overloaded, sleeping for 30 seconds and retrying. 9001 retries left.
s3 is overloaded, sleeping for 30 seconds and retrying. 9001 retries left.
s3 is overloaded, sleeping for 30 seconds and retrying. 9001 retries left.

over and over without the "uploading [x]" part (i.e., it never tried to upload anything). After three or four lines of that, it uploaded without issue. (Note, in DuckHP's case, it was uploading the video file over and over; for me, it was the annotations file. This appears to be because the file it chooses to upload first appears to be random for each setup. For instance, on my VPS, it always tries to upload the files as annotations/jpg/description/video/info.json, whereas on my home setup, it's annotations/description/info.json/jpg/video. If I destroy my VPS image and create a new one, the order changes again.)

I also have two other 'member' level accounts, and on all three member-level accounts, uploading is hit or miss, but on my archivist account, it works just fine. I noticed that your account, @vxbinaca, is an archivist account, and it appears that wertercatt's (powerKitten [@powerkitten]) is also an archivist account. DuckHP's (which I'm guessing is ola_norsk [@duckhuntpr0]), is not.

Could this just be pure coincidence? Sure, but I'm beginning to think it may not be. Of course, I have no idea what this would have to do with your tool per se (probably nothing), but DuckHP may want to try having a collection made at the IA (thus becoming an archivist) and see if that makes any difference.

Sorry for the lengthy post!

ghost commented 5 years ago

@vxbinaca @prof-frink For what is worth, here's what i got of how a tubeup [finally] failing after constant S3 overload. (not full length buffer unfortunately) . It's of my $ time tubeup <url> test , as per https://github.com/bibanon/tubeup/issues/82#issuecomment-447577125

http://paste.ubuntu.com/p/hq3njm5kHY/

However, during the timeframe of that test, note that i've successfully tubeup'ed numerous videos with no problem whatsoever.

ghost commented 5 years ago

but DuckHP may want to try having a collection made at the IA (thus becoming an archivist) and see if that makes any difference.

@prof-frink I'll try to look into that. Thanks for the tip. I'd dread to realize there's any sort of non-disclosed 'embargo' against certain (fully legal) content, certain topics or certain authors etc. etc. going on over at IA

vxbinaca commented 5 years ago

I don't think it's account-based but that's plausible. That's a question you should be asking Archive.org not us.

On my home machine and VPS it uploads json first, then thumbnail, then video, then XML. But the order doesn't matter only that it uploads.

I don't hate to burst you twos bubble, but if IA is discriminating against your uploads then it'd be much more overt. Like getting your account locked - which has happened to me twice - or getting the entire creator darked, or a stern email like I've also gotten a few times. Why do something contrived like mask it as a network error on one particular video?

The Internet Archive does not discriminate against (for example) right-wing content. Period. It contains the majority of the works of FBI informant and Vice-interviewed idiot Christopher Cantwell. numerous grabs of Stromfront in the Wayback. I manage a rather timely mirror of Sargon Of Akkad (who isn't right-wing at all). I really resent it when people accuse IA of censoring endangered content based on politics.

One more thing: This entire issue is based on leaps in logic and filling in the blanks that a few of you are doing. Perhaps base your questions on observable evidence and use logic.

ghost commented 5 years ago

But the order doesn't matter only that it uploads.

@vxbinaca In the case of ~1GB+ data/video files being continuously re-uploaded, with 30s delay between each re-upload, (in some cases for 3+ days, and at times needlessly and futile if it eventually failing after 3 days) ; I'd argue that it is indeed a matter that should be looked into.

And i'm absolutely sure you'll agree, that it's an extremely ineffective way of going about the problem.

vxbinaca commented 5 years ago

It's not over 1 gig files I've had 30 gigabyte video uploads smoothly go up. So that's not actually what happens, what happens is a chunk fails and it attempts to re-upload the chunk.

The alternative is to remove the safety features, let it fail and not preserving the video. Take it up with IA if one particular item is having an issue.The script already works the best it can, it's the service (Archive.org) that's having an issue.

Take up S3 failures with IA. Not us. Tubeup cannot change to fix your problem and will not because it's already configured pretty well.

Take it up with Archive.org. There is no fix we can offer. I'm repeating myself here over and over and I'm tired of it.

ghost commented 5 years ago

Tubeup cannot change to fix your problem and will not because it's already configured pretty well.

No offence intended, and i think tubeup is awesome; But, would e.g implementing preliminary dummy-files, instead of pumping out gigabytes of bandwidth, be an impossibility?

Anywho, I consider this Issue settled, for now. I'll fix it my damn self, if even possible..

It's opensource after all.. ;) God Jul :beers: :christmas_tree:

vxbinaca commented 5 years ago

Your suggestion entails making Tubeup more complex than it needs to be. I'm actually trying to go through with @brandongalbraith and reduce complexity.

Re-run your problem rip later after rebooting your VPS or hash out what problems is with IA.

In the meantime, ythis document may aid in your search for the problem

vxbinaca commented 5 years ago

But PR away. Do understand that a PR isn't a substitute for hashing out what is going on with IA with their staff, and please for my own edification I'd appreciate a summary of what the hell is actually happening on IAs end. I'd really like to know why a few edge cases are having trouble with particular items uploading.

ghost commented 5 years ago

But PR away. Do understand that a PR isn't a substitute for hashing out what is going on with IA with their staff, and please for my own edification I'd appreciate a summary of what the hell is actually happening on IAs end.

I seem to have lost the thread of all this. I don't know what PR means in this regard, nor do i have any insight into what is going on with IA with their staff .

If it was directed to @prof-frink , maybe an additional ticket would be more suitable. In any case, I feel i've gotten my initial issue answered to an extent i'm fairly conformable with.

Happy holidays! :smile: :beer:

vxbinaca commented 5 years ago

PR = pull request, a contribution of code.

prof-frink commented 5 years ago

I don't hate to burst you twos bubble, but if IA is discriminating against your uploads then it'd be much more overt.

Lol, you ain't bursting my bubble. Like I said, I've been able to upload videos since then without issue, regardless of content. But yes, in the beginning, I thought they may have been throttling my account based on what I was uploading. After all, they were making some of my items login-only, so it only seemed natural to assume that maybe they were censoring me in some other way. When you're in this business of preserving censored material, you naturally tend to become a bit paranoid, I think.

I really resent it when people accuse IA of censoring endangered content based on politics.

See my above comment on some of my items being made login-only (which is most definitely a form of censorship). But blocking access outright? No, so far I haven't seen evidence of that, with the exception of the William Luther Pierce audio collection, which was only removed based on a copyright claim by his estate. (I think the IA should have put up a bit of a fight, like they usually do in such cases, or made them locked but preserved, rather than deleting them outright, but eh, I'll give them a pass for that since at least there was a legitimate reason for removal.) So that makes them better than any of these so-called 'free-speech' platforms like Bitchute, I'll admit.

One more thing: This entire issue is based on leaps in logic and filling in the blanks that a few of you are doing. Perhaps base your questions on observable evidence and use logic.

I simply made an observation. Like I said before, I'm not a programmer, so I don't know the ins and outs of all of this. I just thought this might help you or the individual in question to pinpoint the cause of the problem since it keeps reoccurring. No need to get 'short,' in your own words. I support you and your efforts, I really do, and I hope that someday (once I teach myself some Python coding), I can actually be of some assistance.

Oh yes, and Merry Christmas!

prof-frink commented 5 years ago

I'll also add that I've had content be blocked from uploading to the IA and accused of being 'spam,' but if I remove certain words from the metadata (particularly the 'J' word), it uploads just fine and then you can add whatever metadata you want. Don't want to burst your bubble, but the IA isn't some censorship-free utopia. It IS, however, still the most censorship-free place on the live web to post things, which is why I still support them wholeheartedly (for example, donating money to them), even if I'm occasionally critical of them.

antonizoon commented 5 years ago

In the end, the fact is that on the modern cloud, despite how much freewheeling made possible by advertising patronage and what has been considered dereliction of duty has been going on, storage is not unlimited, storage providers have legal rights to remove data at will not to keep it, the patrons as stakeholders reserve the right to ensure their funds do not go to those they are against, and we really have or if you look at it honestly, we deserve very little say in it because we didn't pay for a cent of the bandwidth and storage we are consuming. The law is on their side on this, as the ACLU states, "The First Amendment does not protect freedom when other actors (firms, people, organizations) are the sources of restrictions." The question of mortality is another factor, but property rights essentially trump free speech here as they are not public spaces or government entities that you have a stake in with taxes.

I will tell you like it is. The only way you are going to have full say over what is shared on the internet is if you pay for or self host every cent of bandwidth and storage used. Whether this is with capital to server providers, or with capital goods such as servers, hard drives, or p2p IPFS nodes, you like I have done will have to start running your own large data storage nodes.

You may have seen the clearest case of this in YouTube's removals, but this is growing to be the case across the internet which has started to become "Too Big to Fail". This in turn has impacted the Internet Archives ability to continue to archive, because if you just do the math there is no way a significant fraction of the content of a single modern web service can fit in there. Multiple times I have seen uploaders like DKL3 upload junk that just takes up space such as thousands of hours of Chinese sanic video game livestreams, without thought as to its notability or how it misallocates resources from other areas. This is what they generally consider to be spam and not worthy of archival. But there are other factors such as legal requirements which at least restrict what can be displayed on the Wayback Machine, as well as restrictions by those who fund and allow the organization to operate in their borders. For example, corporations consistently demand DMCA removal of content on the internet archive and if they are verifiable rightsholders there is nothing they can do to refuse without jeopardizing the operation of the rest of the site.So in many cases the people who made the videos themselves want the videos blocked from there, or at least their patron corporation does, primarily because they will not accumulate advertising revenue on there. Many times these requests are also kept confidential due to legal requirements, so we are never going to get a full explanation nor do we have any right to know.

This part is a big deal and no coincidence at all, primarily because most of the content being reuploaded was already removed from YouTube for such reason, so the powers that be are not pleased to see it pop up again and send the same request to the Internet Archive for removal.

(As a website operator myself, when you get these DMCA, COPPA, national security letters, or other types of legal requests you are threatened with lawsuits or shutdown or civil penalties if you fail to comply, so negligence is not an option, it has to be done to save the rest of the site and yourself. And while the discretion of the patrons or the bandwidth providers may not be a legal matter it is a practical matter of losing that funding...)

In the case of DKL3 he was forced to quit and he has since begun using his own resources to archive his eclectic and honestly incomprehensible content. Will the stuff he saved be valued far in the future? We cannot claim to know, but Archival is always going to be about curation, and it will be influenced by the requirements of the patrons that paid, not by the public who freeload.

You are going to have to take up the financial and management burden yourself and not depend on others to store your content if you believe that the powers that be are not going to keep your content there. The IPFS protocol, now enhanced with Infura and Cloudflare HTTP entry nodes, allow you and your friends to"seed" web content hashes as if they were torrents (meaning people have to keep seeding for them to stay). Property rights are very secure in the US with few exceptions, and that is the sort of thing where free speech applies, so be one like I am.

prof-frink commented 5 years ago

Wasn't aware of DKL3's case (though I noticed he hadn't uploaded anything lately), but in my case, it was just short videos or documents being uploaded using the HTML5 uploader where the description or subject might contain the word 'j-w' or one of its derivatives (not slurs like h-, k-, etc.), and it would be blocked as spam, but if I changed it to 'puppies, kittens, love', it uploaded just fine and then I could change the metadata afterwards. This happened several times.

But you're right about the censorship issues (though for some reason I thought I had heard that the IA was not subject to DMCA takedowns, but perhaps I misunderstood the matter), which is why I'm not sure if the IA is necessarily a permanent solution for archival of this material. But I regard it as at least a semi-permanent one, and hopefully increasing storage capacity will render local storage of all this material possible by the time this becomes an issue (or if it becomes one). At least that's my hope. In the meantime, I have to be rather selective in what I can store locally. I could well be wrong, but I think it may still be a while before some sort of blockchain solution becomes truly feasible, at least on a scale necessary to handle all of this material (but honestly, I'm not really up on all of that).

vxbinaca commented 5 years ago

Librarians curate shit all the time.I trash items out of "ytpmv-mad" because they're retarded letsplays or DDR sessions or not YTPMV. This is normal. I move things around or add collection tags as needed.

@DuckHP have you emailed staff yet? Answer yes or no. If yes, reply with a summary of what they said. Any answer from anyone else will result in a lock of this thread and auto close/lock of any other S3 threads. Any answer from @DuckHP that's not "yes" with a summary of the exchange will result in a close/lock with close/locks of any future S3 error issues.

So let's hear what staff said.

Edit: an acceptable answer is "I emailed staff and am waiting for a reply".

rudolphos commented 5 years ago

I've been having the same problems.. There are certain videos and channels that can't be uploaded, seems to be mainly videos that are over 200 MB. It's very rare that it works out and finally uploads. There are few videos that I've been trying to upload for a few months now. They're just piling up (100 GB now) and I'm unable to upload I just cancel and try again week later.

EDIT: even ~40 MB video is impossible to upload.

vxbinaca commented 5 years ago

But I just uploaded a multiple gigabyte stream today.

Congratulations @rudolphos you just got all further S3 timeout issues automatically closed forever since no one does what I tell them to. There will be no discussions on them. There will be no debate. I've time and time again said what the problem and path to a solution is and no one follows it. So that's it.

On top of that Rudolphos, you're on my shit list for issue submissions because you try to mask your items and new account from IA staff and telling us there's a problem without letting us examine it and reproduce it wastes out time. I don't care what you do or if you evade bans, but you and the rest of your friends who hang out in chat that are all coincidentally in this thread with your 100 percent incorrect observations and it's wasting my time.

So that's it.