academictorrents / academictorrents-docs

https://academictorrents.com/docs
16 stars 3 forks source link

Longterm Usage & Costs Statistics #31

Open sneakers-the-rat opened 2 years ago

sneakers-the-rat commented 2 years ago

Hello academictorrents! love what y'all do. Can't tell if this is the right place to ask, but thought i'd give it a shot.

Writing about using p2p for scientific data, and I wonder if you have any longterm download statistics as well as estimates for the costs of maintaining the site? I see from your 990 form that you have <$50k annual revenue, but wanted to know if you could give me any more specific numbers :)

ieee8023 commented 2 years ago

Can't tell if this is the right place to ask

It is the right place!

I wonder if you have any longterm download statistics

There is a live "Site Statistics" table on this page: https://academictorrents.com/give/ I can run some queries if you have specific stats you want. The total data ever downloaded has been 13.70PB.

estimates for the costs of maintaining the site

There was once a cost around $300/year for web hosting but the server is now donated by a university (the OSU OSL). The file hosting that we manage is donated by various seedbox companies, an ISP, and a university (logos on home page). If my spreadsheets are right we have 51.2TB across the donated servers and some have upload caps at 15TB. You may be interested in this feed to monitor some of the seedboxes (the ones that run Transmission) https://academictorrents.com/stats/seedboxes.json which provides upload speeds every hour. I haven't done the math to compare with AWS to see what it is all worth. But I think what we manage is small compared to the community hosted data.

The main cost is development, maintenance, and optimization work which I do for free. Also, a time burden is all the fake DMCA requests that get sent to google that I need to constantly submit counter notifications manually via a web form to Google (people just spam every site that matches *torrents.com). In terms of revenue there are ads that earn about $10 a day and those funds are saved in the non-profit to ensure that the site operates for the next 100+ years.

sneakers-the-rat commented 2 years ago

that's extremely helpful, thanks. I'll get back to you with some comparison estimates of AWS costs. good lord why do we think we need to use AWS.

I think the stats that come to mind would be # of downloads and data transferred aggregated by something even as crude as by year, as I imagine it has changed over the lifetime of the tracker.

It also would be nice to know the spread of bandwidth and seeds across torrents: eg. is bandwidth concentrated on a few popular torrents vs spread more evenly. that, and longevity: like how many torrents die because they lose all their seeds. Is there a kind way for me to scrape that? don't want to ask you do you the work of getting that data if it's not easy to access, but also don't want to just hammer the site doing an HTTP scrape if that would be disruptive.

ieee8023 commented 2 years ago

Here are the download stats per day. This is tracked based on a client reporting "completed" and then the file is assumed fully downloaded. But clients sometimes report this if only downloading a subset and sometimes don't report at all.

torrent_daily_stats.csv

And here is a query with downloads per torrent, size, and the current seeders and leechers. This is also tracked based on a client reporting "completed".

times_completed.csv

sneakers-the-rat commented 2 years ago

you're the best, thanks. anything I should read or cite aside from the paper? writing about p2p and other digital infra for science, and I imagine you've got some stories and a point of view or too :p

ieee8023 commented 2 years ago

Here are some papers that may be relevant:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010071 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3314880/ http://www.sciencedirect.com/science/article/pii/S0140366410002148

And here is a presentation about AT from 2018 which may give some context: https://docs.google.com/presentation/d/140qqf-xR4wFXAYdANewU6GPidpG2lO1jpyymid-H5KU/edit

sneakers-the-rat commented 2 years ago

Thank you, love the presentation, I had made an interactive version of the slides 8-16 here: https://jon-e.net/infrastructure-presentation/?slideIndex=9&stepIndex=1 (click on the various nodes to simulate a download)

I hadn't seen the python package, going to add that to this paper because it's definitely relevant to integrating data storage/sharing with experimental and analysis tools

ieee8023 commented 2 years ago

Looks great!

Ya the python package has some issues with large files though and it is not actively maintained right now. A benefit of using BitTorrent is that there are plenty of high performance download clients like aria2c which can handle TB scale datasets with ease.