At https://archive.org/details/audio users can set CC licenses when uploading their work. I've found examples of CC0 and PDM while browsing the collection at random. Unclear how many objects are here, and whether there's an API with the necessary data.
More ticket work is required to see if there's a path forward here
Checklist to complete before beginning development
No development should be done on a Provider API Script until the following info is gathered:
[ ] Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
[ ] Verify the API provides license info (license type and version; license URL provides both, and is preferred)
[ ] Verify the API provides stable direct links to individual works.
[ ] Verify the API provides a stable landing page URL to individual works.
[ ] Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
[ ] Attach example responses to API queries that have the relevant info.
General Recommendations for implementation
The script should be in the src/cc_catalog_airflow/dags/provider_api_scripts/ directory.
The script should have a test suite in the same directory.
The script must use the ImageStore class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py).
The script should use the DelayedRequester class (Import this from
src/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py).
The script must not use anything from
src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py, since
that module is deprecated.
If the provider API has can be queried by 'upload date' or something similar,
the script should take a --date parameter when run as a script, giving the
date for which we should collect images. The form should be YYYY-MM-DD (so,
the script can be run via python my_favorite_provider.py --date 2018-01-01).
The script must provide a main function that takes the same parameters as from
the CLI. In our example from above, we'd then have a main function
my_favorite_provider.main(date). The main should do the same thing calling
from the CLI would do.
The script must conform to PEP8. Please use pycodestyle (available via
pip install pycodestyle) to check for compliance.
The script should use small, testable functions.
The test suite for the script may break PEP8 rules regarding long lines where
appropriate (e.g., long strings for testing).
Examples of other Provider API Scripts
For example Provider API Scripts and accompanying test suites, please see
src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py and
src/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py, or
src/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py and
Issue author amartya-dev commented on Thu Mar 19 2020:
Information gathered about the API
The following is the information about the API:
[x] Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API.
Yes the files can be fetched systematically after deciding the number of entries that should be included in a page. The API also provides a way for pagination as we can provide the parameter page in the API request.
The endpoint is: https://archive.org/advancedsearch.php.
Documentation for the API: https://blog.archive.org/developers/
The other official documentation: https://archive.org/services/docs/api/
The other documentation provides for a command line script and a python wrapper which can be used after obtaining the API credentials from Internet Archive.
[x] Verify the API provides license info (license type and version; license URL provides both, and is preferred)
The API provides the license URL with the licenseurl key in the response JSON.
[x] Verify the API provides stable direct links to individual works.
The API does not provide the links directly but they can be easily formed by the identifier and metadata provided by querying a separate endpoint.
[x] Verify the API provides a stable landing page URL to individual works.
The API provides a stable landing page URL for the works: https://archive.org/details/
[x] Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
[x] Attach example responses to API queries that have the relevant info.
Example response:
{'responseHeader': {'status': 0,
'QTime': 788,
'params': {'query': 'mediatype:audio',
'qin': 'mediatype:audio',
'fields': 'identifier,title,mediatype,collection,licenseurl,date',
'wt': 'json',
'rows': '2',
'start': 0}},
'response': {'numFound': 9441652,
'start': 0,
'docs': [{'collection': ['audio_sermons', 'audio_religion'],
'identifier': 'JesusTheRescuer',
'licenseurl': 'http://creativecommons.org/licenses/by-nc-nd/3.0/',
'mediatype': 'audio',
'title': 'Jesus, The Rescuer'},
{'collection': ['audio_sermons', 'audio_religion'],
'date': '2015-07-05T00:00:00Z',
'identifier': 'July52015EveningSermon',
'mediatype': 'audio',
'title': 'What to Do When the Foundations Are Destroyed'}]}}
This issue has been migrated from the CC Search Catalog repository
At https://archive.org/details/audio users can set CC licenses when uploading their work. I've found examples of CC0 and PDM while browsing the collection at random. Unclear how many objects are here, and whether there's an API with the necessary data.
More ticket work is required to see if there's a path forward here
Provider API Endpoint / Documentation
Provider description
https://archive.org/details/audio
Licenses Provided
Provider API Technical info
Checklist to complete before beginning development
No development should be done on a Provider API Script until the following info is gathered:
General Recommendations for implementation
src/cc_catalog_airflow/dags/provider_api_scripts/
directory.ImageStore
class (Import this fromsrc/cc_catalog_airflow/dags/provider_api_scripts/common/storage/image.py
).DelayedRequester
class (Import this fromsrc/cc_catalog_airflow/dags/provider_api_scripts/common/requester.py
).src/cc_catalog_airflow/dags/provider_api_scripts/modules/etlMods.py
, since that module is deprecated.--date
parameter when run as a script, giving the date for which we should collect images. The form should beYYYY-MM-DD
(so, the script can be run viapython my_favorite_provider.py --date 2018-01-01
).my_favorite_provider.main(date)
. The main should do the same thing calling from the CLI would do.pycodestyle
(available viapip install pycodestyle
) to check for compliance.Examples of other Provider API Scripts
For example Provider API Scripts and accompanying test suites, please see
src/cc_catalog_airflow/dags/provider_api_scripts/flickr.py
andsrc/cc_catalog_airflow/dags/provider_api_scripts/test_flickr.py
, orsrc/cc_catalog_airflow/dags/provider_api_scripts/wikimedia_commons.py
andsrc/cc_catalog_airflow/dags/provider_api_scripts/test_wikimedia_commons.py
.Original Comments:
Issue author amartya-dev commented on Thu Mar 19 2020:
The following is the information about the API:
[x] Verify there is a way to retrieve the entire relevant portion of the provider's collection in a systematic way via their API. Yes the files can be fetched systematically after deciding the number of entries that should be included in a page. The API also provides a way for pagination as we can provide the parameter page in the API request. The endpoint is: https://archive.org/advancedsearch.php. Documentation for the API: https://blog.archive.org/developers/ The other official documentation: https://archive.org/services/docs/api/ The other documentation provides for a command line script and a python wrapper which can be used after obtaining the API credentials from Internet Archive.
[x] Verify the API provides license info (license type and version; license URL provides both, and is preferred) The API provides the license URL with the licenseurl key in the response JSON.
[x] Verify the API provides stable direct links to individual works. The API does not provide the links directly but they can be easily formed by the identifier and metadata provided by querying a separate endpoint.
[x] Verify the API provides a stable landing page URL to individual works. The API provides a stable landing page URL for the works: https://archive.org/details/
[x] Note other info the API provides, such as thumbnails, dimensions, attribution info (required if non-CC0 licenses will be kept), title, description, other meta data, tags, etc.
[x] Attach example responses to API queries that have the relevant info. Example response: {'responseHeader': {'status': 0, 'QTime': 788, 'params': {'query': 'mediatype:audio', 'qin': 'mediatype:audio', 'fields': 'identifier,title,mediatype,collection,licenseurl,date', 'wt': 'json', 'rows': '2', 'start': 0}}, 'response': {'numFound': 9441652, 'start': 0, 'docs': [{'collection': ['audio_sermons', 'audio_religion'], 'identifier': 'JesusTheRescuer', 'licenseurl': 'http://creativecommons.org/licenses/by-nc-nd/3.0/', 'mediatype': 'audio', 'title': 'Jesus, The Rescuer'}, {'collection': ['audio_sermons', 'audio_religion'], 'date': '2015-07-05T00:00:00Z', 'identifier': 'July52015EveningSermon', 'mediatype': 'audio', 'title': 'What to Do When the Foundations Are Destroyed'}]}}
source
mathemancer commented on Tue Mar 24 2020: