WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
239 stars 190 forks source link

The Cultural Broadcasting Archive #1604

Open zackkrida opened 2 years ago

zackkrida commented 2 years ago

The Cultural Broadcasting Archive of Austria reached out to us about inclusion in Openverse. We're waiting for more technical detail but here's what they've provided so far.

Source Site

https://cba.fro.at/explore

Value Provided

120,000 German-language podcasts

Licenses Provided

All CC licenses are in use.

Implementation

thomasdiesenreiter commented 1 year ago

Hi there, which technical detail would be need to get included?

zackkrida commented 1 year ago

Hi @thomasdiesenreiter. The first thing to know—is there an API or other mechanism to get the data of all items in CBA? If there's not an API, we could also figure something out with, for example, bulk files containing the metadata about works in CBA that we can access periodically.

thomasdiesenreiter commented 1 year ago

Hi there, the cba is basically a heavily expanded wordpress site. We use a modified cpt structure with stations<podcasts<posts<attachments. For audio attachments we create additional metafiles like waveforms.

The easiest way atm is to use the rest API. We have some custom routes which pull all the relevant information for the mediafiles together. I can try to setup an in detail documentation.

One disclaimer: Some files have copyrighted music in it. We have the proper licenses for playout, but nonetheless are working on providing automatically cut versions without copyrighted music (including an editor for endusers to identify which files are copyrighted). I suppose that would be the version that could added to openverse?

Also: I hadn't had the time to take a look if you restrict the kind of cc licenses you accept. Do you have any restrictions here? We offer our users the freedom to select the cc version they would like to use.

thomasdiesenreiter commented 1 year ago

Also a small correction: We currently host 138.000 broadcasts, of which around 80% are german, and the other 20% are in 48 other languages.

zackkrida commented 1 year ago

Thanks, Thomas! We include all of the CC licenses. We also already ingest some sources using WP Rest API endpoints so we have experience there too.

The copyrighted music piece does sound important. It is likely we would have to wait for that to be completed. Or, are broadcasts with copyrighted music labeled as such? If they are we could just exclude them from our pipeline.

thomasdiesenreiter commented 1 year ago

Good to know, thanks!

Yes, broadcasts with copyrighted music are labeled and can be filtered. Maybe then it is best to wait for us to finish the process of cleaning the broadcasts from the copyrighted music and then get back to you.