creativecommons / quantifying

quantify the size and diversity of the commons--the collection of works that are openly licensed or in the public domain
MIT License
22 stars 30 forks source link

[Feature] Automate Data Gathering and Analysis/Rendering #22

Closed TimidRobot closed 1 month ago

TimidRobot commented 1 year ago

Problem

The focus of this project is on handling data in a way that is reproducible and update-able.

Description

Alternatives

Do everything manually 😩

Additional context

(Suggestions welcomed! Please comment if you have a relevant links to share.)

TimidRobot commented 1 year ago

Also see @Bransthre 's 2022-11-26 blog post about this issue:

samadpls commented 1 year ago

Hello @TimidRobot ! I'm interested in contributing to this project as part of GSoC 2023. After reading the problem and description, I think I have a good understanding of the goals of the project and the criteria for success. However, I would appreciate more information on some of the specific questions raised in the description.

First, regarding the frequency of data gathering and analysis, are there any constraints or limitations that need to be taken into account? For example, are there certain times of day when data should be collected, or are there restrictions on the number of requests that can be made to an API?

Second, regarding the strategy for handling data over multiple days, can you provide more information on what kind of data we'll be working with and what the expected volume is? This will help determine what kind of storage and naming conventions we should use.

Finally, regarding the format and storage of the data, what are the specific requirements or preferences for how the data should be formatted and stored? Should we prioritize readability or efficiency, or is there some other consideration to take into account?

Thank you for your time and guidance! I'm excited to work on this project and look forward to hearing back from you.

Kd-Here commented 1 year ago

Interested in this project let's us know once we had mentor.

TimidRobot commented 1 year ago

@samadpls

First, regarding the frequency of data gathering and analysis, are there any constraints or limitations that need to be taken into account? For example, are there certain times of day when data should be collected, or are there restrictions on the number of requests that can be made to an API?

Different APIs have different limits on queries per day. (Adding this information the README or creating a dedicated sources markdown document would be helpful--see #37).

Second, regarding the strategy for handling data over multiple days, can you provide more information on what kind of data we'll be working with and what the expected volume is? This will help determine what kind of storage and naming conventions we should use.

See the existing CSV files and scripts.

Finally, regarding the format and storage of the data, what are the specific requirements or preferences for how the data should be formatted and stored? Should we prioritize readability or efficiency, or is there some other consideration to take into account?

This is an unanswered question. However, any proposed solutions should be compared against CSVs for readability/interoperability and SQLite for efficiency.

HoneyTyagii commented 1 year ago

@TimidRobot Greetings, I stumbled upon your GitHub repository, and I'm interested in contributing.

TimidRobot commented 1 year ago

@HoneyTyagii Welcome! Please see Contribution Guidelines — Creative Commons Open Source.

HoneyTyagii commented 1 year ago

@HoneyTyagii Welcome! Please see Contribution Guidelines — Creative Commons Open Source.

@TimidRobot Thanks for getting back to me! I really appreciate the prompt response and the link to the contribution guidelines. I'll make sure to read through them thoroughly before submitting any contributions. If I have any questions, I'll reach out to you for further assistance. Thanks again!

satyampsoni commented 1 year ago

Hello @TimidRobot ! I am interested in contributing to this project in GSOC 2023. I tried to understand the Project and I request you to help me wherever I am mistaken. Here is the Summary I have written based on my understanding.

A general overview of the steps involved in the Code-base (GitHub):

  1. Data collection: The code collects data from various sources, such as the Creative Commons search engine, the Flickr API, and Wikimedia Commons.

  2. Data cleaning: The collected data is cleaned and standardized to remove duplicates, missing values, and other errors.

  3. Data analysis: The cleaned data is analyzed using statistical methods and machine learning algorithms to identify patterns and trends in the data.

  4. Report generation: Based on the analysis, reports are generated using Python libraries such as Matplotlib and Pandas. The reports include visualizations and tables that summarize the data and provide insights into the impact of Creative Commons licenses.

  5. Automation: To ensure that the reports are never more than three months out of date, the code-base uses automation techniques, such as GitHub Actions, to periodically run the data collection, cleaning, analysis, and report generation steps.

Any further assistance will be highly appreciated.

Paulooh007 commented 1 year ago

Hello @TimidRobot ! I am interested in contributing to this project in GSOC 2023. I tried to understand the Project and I request you to help me wherever I am mistaken. Here is the Summary I have written based on my understanding.

  • The Quantifying the Commons project is an initiative by Creative Commons to measure the impact of Creative Commons licenses on the sharing and reuse of creative works
  • The main objective of the project is to automate the process of data gathering and reporting so that the reports are never more than three months out of date

A general overview of the steps involved in the Code-base (GitHub):

  1. Data collection: The code collects data from various sources, such as the Creative Commons search engine, the Flickr API, and Wikimedia Commons.
  2. Data cleaning: The collected data is cleaned and standardized to remove duplicates, missing values, and other errors.
  3. Data analysis: The cleaned data is analyzed using statistical methods and machine learning algorithms to identify patterns and trends in the data.
  4. Report generation: Based on the analysis, reports are generated using Python libraries such as Matplotlib and Pandas. The reports include visualizations and tables that summarize the data and provide insights into the impact of Creative Commons licenses.
  5. Automation: To ensure that the reports are never more than three months out of date, the code-base uses automation techniques, such as GitHub Actions, to periodically run the data collection, cleaning, analysis, and report generation steps.

Any further assistance will be highly appreciated.

Hi @satyampsoni , From my understanding i think you've correctly summarised the project. Just that the automation hasn’t been implemented yet. I personally found this article series by @Bransthre quite helpful, it explains the whole development process. @TimidRobot already shared a part of it.

satyampsoni commented 1 year ago

Thanks for sharing the blog @Paulooh007 ! I am checking it out and if I need any help I'll reach out to you.

satyampsoni commented 1 year ago

In sources.md only 8 sources of data gathering is present while the article series covers the 9 sources of data. Deviantart data sources is not present over there.. @TimidRobot @Paulooh007 so you know the reason or by mistake is not placed there?

Paulooh007 commented 1 year ago

In sources.md only 8 sources of data gathering is present while the article series covers the 9 sources of data. Deviantart data sources is not present over there.. @TimidRobot @Paulooh007 so you know the reason or by mistake is not placed there?

Both the google custom search and deviantart scripts use the same data source, They both use the Custom Search JSON API, The API performs a Google Search with the specified arguments provided in its API call.

So for deviantart, we’re limiting the scope of the search by setting the relatedSite query parameter to deviantart.com. This explains why we have only 8 sources. See line 65 of deviantart_scratcher.py

(
    "https://customsearch.googleapis.com/customsearch/v1"
    f"?key={api_key}&cx={PSE_KEY}"
    "&q=_&relatedSite=deviantart.com"
    f'&linkSite=creativecommons.org{license.replace("/", "%2F")}'
)
satyampsoni commented 1 year ago

Oh! I see.

Saigenix commented 5 months ago

hello, I would like to work on this feature And I think this feature is also included in Gsoc 2024 So when should I start contributing or analyzing Can you please elaborate @TimidRobot

TimidRobot commented 5 months ago

@Saigenix welcome!

Please see Contribution Guidelines — Creative Commons Open Source for how we manage issues and PRs (we generally don't assign issues prior to resolution).

Also, this issue largely duplicates the GSoC 2024 Automating Quantifying the Commons project. You may find #39 more helpful:

Saigenix commented 5 months ago

Thank you @TimidRobot

Darylgolden commented 5 months ago

Hi @TimidRobot!

As mentioned in the Slack, I'm interested in working on this project.

What is the strategy for gathering data over multiple days (due to query limits)?

A GitHub action can be scheduled to run at scheduled times each day. By storing data about the last successful run, we can run each task only when it is sufficiently outdated, and with exponential backoff, for instance.

Please start with the assumption that each combination of source and stage will require it’s own script to be executed 1+ times by GitHub Actions.

That's certainly possible, and is probably the simplest solution to get a minimal working product. However, it might then be more challenging to implement the scheduling logic. I think it would be difficult to do directly in GitHub actions, so it's probably best to use a helper script, but at that point we may as well convert the scripts into classes with methods and run it as a unified program anyways. All of the scripts seem to simply define a few constants and functions, then run a few functions such as set_up_data_file() and record_all_licenses(), so I don't think it would be complicated to package them into classes. This approach also helps code deduplication; common logic can be implemented in a base class which the others inherit from.

Storing the data in the repository has issues, but it is also simple and free.

One concern I have about this approach is that if the automation scripts were to run regularly (eg. daily), it would result in a lot of commits to the repository which could make the commit history hard to navigate. Though I suppose if you are willing to live with this, then there isn't much of a downside. Another option is to commit the data into another branch, like what GitHub pages does.

What is the strategy for ensuring automated updates do not result in broken/incomplete state if they don't complete successfully?

I think we should start by splitting each task into many small subtasks, each one being able to run and update data independently. For example, vimeo_scratcher.py queries 8 different licenses, with each query being able to run independently. Them each subtask writes data only if it successfully completes. This would work best with a data format that allows each entry to be updated independently and asynchronously, which is why I think something like an SQL database would be ideal.

TimidRobot commented 4 months ago

@Darylgolden

Please start with the assumption that each combination of source and stage will require it’s own script to be executed 1+ times by GitHub Actions.

That's certainly possible, and is probably the simplest solution to get a minimal working product. However, it might then be more challenging to implement the scheduling logic. I think it would be difficult to do directly in GitHub actions, so it's probably best to use a helper script

Remember that the goal is a complete report every quarter--every three months. Handling state will be a primary concern. Each query will need to be scheduled to run multiple times for both large data sets (ex. to work with daily query limits) and for redundancy. I usually prefer shared libraries instead of a single launcher/helper script.

but at that point we may as well convert the scripts into classes with methods and run it as a unified program anyways. All of the scripts seem to simply define a few constants and functions, then run a few functions such as set_up_data_file() and record_all_licenses(), so I don't think it would be complicated to package them into classes. This approach also helps code deduplication; common logic can be implemented in a base class which the others inherit from.

I suspect the data is too available from the various sources is too different to benefit from unification. I prefer to avoid Classes until their complexity and obfuscation is clearly worth it. That may be the case here, but everyone deserves to know my biases.

Storing the data in the repository has issues, but it is also simple and free.

One concern I have about this approach is that if the automation scripts were to run regularly (eg. daily), it would result in a lot of commits to the repository which could make the commit history hard to navigate. Though I suppose if you are willing to live with this, then there isn't much of a downside. Another option is to commit the data into another branch, like what GitHub pages does.

At a quarterly cadence, I don't expect it to be too noisy. I don't like long lived special purpose branches. I think they end up hiding information. If it became an issue, a separate repository is also an option.

What is the strategy for ensuring automated updates do not result in broken/incomplete state if they don't complete successfully?

I think we should start by splitting each task into many small subtasks, each one being able to run and update data independently. For example, vimeo_scratcher.py queries 8 different licenses, with each query being able to run independently. Them each subtask writes data only if it successfully completes. This would work best with a data format that allows each entry to be updated independently and asynchronously, which is why I think something like an SQL database would be ideal.

If each query stores it's data in a separate file (ex. CSV), then they can be updated independently and asynchronously. I lean towards plaintext because it prioritizes visibility, human interaction, and broad compatibility.


In general, I encourage everyone to pursue the simplest and most boring technologies for this project. It isn't a technical demo nor a technology learning project. The easier it is to engage with and to maintain, the longer it will benefit the community. I still find PEP 20 – The Zen of Python | peps.python.org to be helpful and instructive.

Darylgolden commented 4 months ago

Thank you @TimidRobot for the reply!

In general, I encourage everyone to pursue the simplest and most boring technologies for this project. It isn't a technical demo nor a technology learning project.

I would like to clarify that I did not propose my implementation with the intent of making it a technical demo or technology learning project, but rather because it was what I initially thought was the simplest and most maintainable design for the project. I have worked in projects with convoluted and unmaintainable code, and I have read the Zen of Python, so I definitely understand the importance of simple and boring code. My instinct for clean design clearly differs from yours, and while I'm of course happy to go with whatever design you think suits this project best, I think I would be doing a disservice if I did not at least try to propose alternative designs to compare the merits of each design. That being said, I do see now the benefits of using a shared library design over helper scripts/OOP and am happy to pursue this design instead.

I think we would need to add three fields for each of the data files, time_of_last_successful_update, time_of_last_failed_update and exponential_backoff_factor. The exponential_backoff_factor field would start off at 0, increasing by 1 with each failure and resetting to 0 with each success. The script would try to update a field only if the current time is more than $2^\text{exponential backoff failure}$ days since the last update. This logic can then be implemented in a library that is used in each of the scripts.

What do you think of this design? If you're happy with it, should I start drafting a proposal?

TimidRobot commented 4 months ago

I think we would need to add three fields for each of the data files, time_of_last_successful_update, time_of_last_failed_update and exponential_backoff_factor. The exponential_backoff_factor field would start off at 0, increasing by 1 with each failure and resetting to 0 with each success. The script would try to update a field only if the current time is more than

State management depends on the architecture of the entire process. For example, if there are separate phases for querying data and processing data, then there is no need to update queried data. Instead each query can write to a separate file (all of which would be combined during processing phase).

For example, potential logic of a query script that is run every day:

  1. Exit if the raw data is complete for this interval
  2. Read state (ex. set a of z) if there are raw data files from a previous run during this interval
    • set size might depend on daily query limits
  3. Query source for current chunk (ex. set b of z) with exponential backoff
  4. Write raw data file on success

For example, potential logic of a processing script that is run every day:

  1. Exit if the processed data is complete for this interval
  2. Exit unless the data is complete for this interval
  3. Read data files (ex. a through z) and combine & process data
  4. Write data file on success

This is not how it must be done, merely a way that I can imagine it. There are complexities that are worth it within the context of the total plan.

wulfeniite commented 4 months ago

(Suggestions welcomed! Please comment if you have a relevant links to share.)

Hi @TimidRobot! What's your opinion on integrating something like OpenTelemetry here?

The JSON file we'll obtain would have standardized data, and we can further work on visualization using in-built tools.

Darylgolden commented 3 months ago

(Suggestions welcomed! Please comment if you have a relevant links to share.)

Hi @TimidRobot! What's your opinion on integrating something like OpenTelemetry here?

The JSON file we'll obtain would have standardized data, and we can further work on visualization using in-built tools.

I'm not @TimidRobot, but it seems like OpenTelemetry is mainly used for collecting data from your own applications and not retrieving data from APIs. Do you have an example of it doing the latter?

Darylgolden commented 3 months ago

What is the strategy for gathering data over multiple days (due to query limits)?

Has there been a case where query limits have actually been hit? Because looking at sources.md, the limits seem much more than enough for our purposes. If that's the case, maybe the simplest solution of just running all the scripts on a schedule is the best.

TimidRobot commented 3 months ago

What is the strategy for gathering data over multiple days (due to query limits)?

Has there been a case where query limits have actually been hit? Because looking at sources.md, the limits seem much more than enough for our purposes. If that's the case, maybe the simplest solution of just running all the scripts on a schedule is the best.

Yes, the Google Custom Search JSON API. See:

Paulooh007 commented 3 months ago

What is the strategy for gathering data over multiple days (due to query limits)?

Has there been a case where query limits have actually been hit? Because looking at sources.md, the limits seem much more than enough for our purposes. If that's the case, maybe the simplest solution of just running all the scripts on a schedule is the best.

Hi, Have you tried running google_scratcher.py, The script queries for licences in legal-tool-paths.txt, and also for all languages in google_lang.txt.

TimidRobot commented 1 month ago

This issue became a GSoC 2024 project.