PRX / www.radiotopia.fm

Radiotopia website
https://www.radiotopia.fm
2 stars 0 forks source link

Repository size #1

Closed farski closed 9 years ago

farski commented 9 years ago

This repo has grown to over 1 GB, which is a little crazy considering the site itself is ~10 MB and hardly changes. I think it's worth figuring out a better way to handle the audio files. I had originally checked in some that were edited website-specific versions; that was probably a mistake to being with. More files have been checked in over time, though, and at this point even if they are just for the website we should find somewhere else to manage them.

@chrisrhoden @debenedictis any thoughts?

debenedictis commented 9 years ago

I can update the audio links to point to the versions in the published feeds. Then I can remove the audio from radiotopia.fm.

Would that help?

debenedictis commented 9 years ago

Most of the audio is already coming from our CDN. The exceptions are the two podcasts that don't use our CDN, Love+Radio and The Truth. I can remove about 500MB of the audio (leaving about 70MB of audio).

@farski Is that what you would like?

farski commented 9 years ago

Most of the audio in the site right now already seems to be coming right from show-specific CDN URLs, which is fine. It wouldn't even be bad if the files were all coming from out S3 like they were originally. The location of the files doesn't really matter, as long as they aren't getting checked into git.

Deleting the files at this point wouldn't help the repo size since all the history would obviously still be there. The data would need to get removed from the git history, which isn't something I'm too familiar with. https://help.github.com/articles/remove-sensitive-data/

I would say if we're going to do the work to fix this there should be no audio left in the repo. It's find to leave it on S3, but it should be managed outside of git.

debenedictis commented 9 years ago

And why remove it from the repo? Do we pay for that storage? If so, how much per GB/year?

cqr commented 9 years ago

It is more about not managing them in the repo at all. Git doesn't care if you remove the files because it needs to keep them around in the history.

We can prune them from the repo but it would mean as well that we need a solid way to manage these sorts of things going forward.

We do not really pay for storage (github has a soft cap) but I do know from insider information that 1G is the upper bound of what they like to host. There are practical git reasons for this (git seems to hit a wall in terms of performance at about this point) but also the fact that GitHub then needs to serve 1G of bandwidth every time there is a new checkout.

This is all part of a larger conversation we need to have about where to keep resources of this kind in general. I honestly do not know of any projects or services that exist to solve this problem, and if there isn't one then coming up with something manual (like just keeping track of CDN URLs and uploading the things that aren't already there to one manually).

Thoughts? On Nov 24, 2014 10:12 AM, "Robert DeBenedictis" notifications@github.com wrote:

Most of the audio is already coming from our CDN. The exceptions are the two podcasts that don't use our CDN, Love+Radio and The Truth. I can remove about 500MB of the audio (leaving about 70MB of audio).

@farski https://github.com/farski Is that what you would like?

— Reply to this email directly or view it on GitHub https://github.com/PRX/Radiotopia/issues/1#issuecomment-64207033.

farski commented 9 years ago

Mostly a matter of convenience, but also convention and the fact that github really doesn't like it. As it is now if someone wanted to checkout the repo to fix a typo or something they'd need to download over a gig of data. By comparison, prx.org is a huge app and has seven years worth of commits and is only ~600 MB. It's not really common for large media files to get checked in, and github can start rejecting them: https://help.github.com/articles/working-with-large-files/

debenedictis commented 9 years ago

I've put "*.mp3" in my .gitignore file; but that will only stop the problem from recurring. Removing the files from the history is tricky. So far this is the best article I've found on that:

git: forever remove files or folders from history
http://dound.com/2009/04/git-forever-remove-files-or-folders-from-history/

As you may have surmised, I only know the most basic git commands/conventions. I can put together a plan for removing the mp3 files from the history, but I would need someone to review it.

farski commented 9 years ago

Fixing the size, either by pruning the files from the existing repo or just creating a new repo (I put the site in git originally because it's the easiest way we have to centrally own files, not because the history is particularly useful to us), should be pretty easy.

As Chris mentioned the thing we need to figure out is how to manage these files going forward. If we can eliminate the need to host any just for the site and can switch them all to external sources that seems like an easy solution. If we know that at some point (now or in the future) there will be audio files that exist just for the site then...I don't know. We either could just have some very strong policy (copies on S3 and the office projects NAS) or investigate something else.

It's hard to imagine these files ever being something we (PRX proper) are producing, so they really shouldn't ever be something we are solely responsible for anyway.

debenedictis commented 9 years ago

@farski This may seem unrelated, but it is not. Can I change autoLoad to false in main.js?

One reason I do not grab the audio from the exact URL in the feed is that I do not want the autoLoad for all the audio on the page to artificially inflate the stats. If I can turn off autoLoad then I have no concern that using the audio URLs from the feed will unduly inflate the stats.

farski commented 9 years ago

I don't have much say over that. They are autoloaded so people don't have to wait, which seems like a good user experience. It's also why all the files were originally hosted by us; we could guarantee performance and not impacting the numbers.

farski commented 9 years ago

As far as managing audio files, they can either stay where they are in the radiotopia bucket, and the bucket and repo would just stay out of sync on purpose, or we could move the files to another bucket keeping the radiotopia bucket in sync with the repo, but meaning there are two places to think about for this site.

I would lean towards the latter, but I really don't touch this property anymore, so I think you can make the call. As long as it's documented in this repo's Readme we should be fine.

kookster commented 9 years ago

agreed with @farski on s3 storage - probably less possibility of wiping out the files if they are under media, perhaps easier workflow if they are under the same radiotopia bucket, hard to say which is better, I leave it to you @debenedictis since you are managing it (though I can make the choice if that helps).

I really have no idea how much this would inflate metrics anyway, but at at least for podtrac files, you can get the url podtrac is redirecting to instead of the podtrac url, and avoid inflation.

Like @farski I err on the side of better user experience, and would prefer we host as little as possible, so to me that means autoload should be on, and when possible use files where from URLs where they are already hosted (without podtrac redirects).

debenedictis commented 9 years ago

I think that for most cases we can just reference the audio with the URL from the feed. All of the requests I've had to update the audio on Radiotopia.fm have been to do so with specific episodes from existing feeds.

I'll test turning off autoLoad to determine how much the user experience degrades.

debenedictis commented 9 years ago

@kookster I can remove all the audio files from Radiotopia.fm and just use the URL feeds (without the Podtrac tracking). This is a problem for The Truth and Love+Radio since they use a stats system where they get credit even if I bypass Podtrac. I can workaround that though.

One reason to turn of autoLoad, though, is so that listens that do occur on Radiotopia.fm can be credited to the shows that are listened to.

kookster commented 9 years ago

I realize the soundcloud stats will be affected; I think that is a fine trade-off for a better experience, which I believe will be there when autoLoad is on.

debenedictis commented 9 years ago

@farski I am thinking of removing the mp3 files with the BFG Repo-Cleaner: http://rtyley.github.io/bfg-repo-cleaner/

Do you have any questions or concerns regarding that?

farski commented 9 years ago

I'm not familiar with it. As long as you have a backup of the current repo there's not really any risk, though.

cqr commented 9 years ago

I know how to use filter-branch pretty well so if this doesn't work (though it looks perfect for the job) I can take care of this very quickly. On Nov 25, 2014 9:51 AM, "Chris Kalafarski" notifications@github.com wrote:

I'm not familiar with it. As long as you have a backup of the current repo there's not really any risk, though.

— Reply to this email directly or view it on GitHub https://github.com/PRX/Radiotopia/issues/1#issuecomment-64410348.

debenedictis commented 9 years ago

@farski @chrisrhoden thank you

kookster commented 9 years ago

@debenedictis I know that popup archive used that same tool for a similar purpose, so I believe it should work.

farski commented 9 years ago

I ran git filter-branch --tree-filter 'rm -rf audio' HEAD and it seemed to have the desired affect. audio/ will be ignored going forward, so this should be all set.