Closed mathemancer closed 4 years ago
It should be noted that the reason not to use standard quantile normalization is to preserve some of the idiosyncrasies of the original distributions, while still ending up with a somewhat comparable standardized popularity.
Fixes
Fixes #431 by @kgodey
Description
This PR adds 3 Apache Airflow DAGs, defined at
src/cc_catalog_airflow/dags/recreate_image_popularity_calculation.py
src/cc_catalog_airflow/dags/refresh_all_image_popularity_data.py
src/cc_catalog_airflow/dags/refresh_image_view_data.py
These DAGs define a "pipeline" in the Upstream DB that culminate with the PostgreSQL materialized viewimage_view
. This new view has a columnstandardized_popularity
containing a float between 0 and 1 representing the popularity of a given image within a givenprovider
(the domain from which the image and its metadata were collected).Technical details
The
standardized_popularity
value is calculated from a raw 'popularity metric' such as views, comments, likes, etc. The current ones we use areviews
in the case of Flickrglobal_usage_count
in the case of Wikimedia Commons (this is the number of 'wikiverse' pages that use an image). We map the raw values to the interval[0, 1)
via a process that allows us to pin a given percentile value within the real distribution (over the values taken by the raw metric for a given provider) to a desired value in the interval. The current implementation pins the 85th percentile value to 0.85.To calculate the standardized value in
[0, 1)
for a given metric, we do the following:V
) within the distribution of the raw metric. In this implementation, the percentile value is stored for eachprovider, metric
pair in the materialized viewimage_popularity_constants
in thevalue
column.C = (1 - 0.85) / 0.85 * V
. This is also stored for eachprovider, metric
pair inimage_popularity_constants
in theconstant
column.x
, calculate the standardized popularity viaP = x / (x + C)
. In this implementation, this is stored in thestandardized_popularity
column of theimage_view
materialized view.Related to the PR: #426
Tests
There are a number of tests for the new functionality.
Also, the reviewer can follow the README to bring up the dev environment, turn on the
tsv_to_postgres_loader
DAG, then run theflicker_ingestion
andwikimedia_commons_ingestion
DAGs for awhile. Once data has been collected, run therecreate_image_popularity_calculation
DAG, and then query theimage_view
in the local DB to see the results.Screenshots
Checklist
- [X] My pull request has a descriptive title (not a vague title like `Update index.md`). - [X] My pull request targets the *default* branch of the repository (`main` or `master`). - [X] My commit messages follow [best practices][best_practices]. - [X] My code follows the established code style of the repository. - [X] I added tests for the changes I made (if applicable). - [ ] ~I added or updated documentation (if applicable).~ - [X] I tried running the project locally and verified that there are no visible errors. [best_practices]:https://gist.github.com/robertpainsi/b632364184e70900af4ab688decf6f53 ## Developer Certificate of OriginDeveloper Certificate of Origin
``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```