Clarify/Refactor Archives Logic

maxachis commented 1 month ago

The Archives resource and its attendant functionality is unclear to me on several levels:

I do not know where its endpoint is being called from
I do not fully understand under what conditions a data source is to be archived
The logic for determining what is archived seems awkward, and I'm not sure why a simple flag of "archived" in each row would not suffice:

def archives_get_results(conn: PgConnection) -> list[tuple[Any, ...]]:
    """
    Pulls data sources for the automatic archives script that performs caching

    :param conn: A psycopg2 connection object to a PostgreSQL database.
    :return: A list of dictionaries representing the rows matching the query conditions.
    """
    cursor = conn.cursor()
    sql_query = """
    SELECT
        airtable_uid,
        source_url,
        update_frequency,
        last_cached,
        broken_source_url_as_of
    FROM
        data_sources
    WHERE 
        (last_cached IS NULL OR update_frequency IS NOT NULL) AND broken_source_url_as_of IS NULL AND url_status <> 'broken' AND source_url IS NOT NULL
    """
    cursor.execute(sql_query)

    return cursor.fetchall()

I'd like to figure out the answer to these questions, and possibly refactor the Archives resource for improved clarity, or update the documentation.

josh-chamberlain commented 3 weeks ago

@maxachis I think this is coming from/used by code in the automatic-archives repo: https://github.com/Police-Data-Accessibility-Project/automatic-archives

maxachis commented 1 week ago

@josh-chamberlain Related to what is discussed in this issue, I think setting up an archived_data_sources table in the database will help clarify logic and keep data_sources from being too populated.

Also (and possibly relevant to the above point), I would like to know by what conditions we determine whether a data source should be archived, or if every data source should be archived.

EvilDrPurple commented 1 week ago

@maxachis

Also (and possibly relevant to the above point), I would like to know by what conditions we determine whether a data source should be archived, or if every data source should be archived.

If I'm following the archive logic correctly in the SQL query, in plain english: A source is archived if: The source was never archived before OR the source is updated regularly AND the source url is not broken AND the source url is not null

So we basically don't want to waste time archiving sources that are never updated and have already been archived or are broken

josh-chamberlain commented 1 week ago

@EvilDrPurple nailed it! want to work on this issue / the other one linked? this is not related to v2, but it's still pretty important that it works correctly and clearly. One day we will get more sophisticated than internet archive.

@maxachis agreed. it was built by someone without a ton of experience, so it makes sense that there would be room for improvement and clarity! at most, data_sources would be linked to archived_data_sources so we know which sources we're talking about, and can answer questions about when a source was last archived and stuff.

I'd consider renaming this issue to something like "separate archives logic from data sources app". Ideally, our API would just exist, and automatic-archives could do everything it needs to do with the API. automatic-archives is an example of the kind of project we would love for other people to build ~independently using our API.

EvilDrPurple commented 1 week ago

@EvilDrPurple nailed it! want to work on this issue / the other one linked? this is not related to v2, but it's still pretty important that it works correctly and clearly. One day we will get more sophisticated than internet archive.

@josh-chamberlain yeah sure! I believe I may need access to the database but I can start on it

Police-Data-Accessibility-Project / data-sources-app

Clarify/Refactor Archives Logic #309