[ORG] API endpoint design

davidpomerenke commented 3 months ago

How do we structure the API, what parameters and outputs should it have?

davidpomerenke commented 3 months ago

My first thoughts on this.

Endpoints:

events

- parameters - `type` (req.): `protest` (and in the future potentially also other event types) - `source` (req.): `acled` / `agence_france_presse` / `german_protest_registrations` - `start_date` (req.) - `end_date` (req.) - `topic` (opt.): `climate_change` / `animal_welfare` / `artificial intelligence` / `antiracism` / ...: these would automatically set relevant filters for keywords and protest organizations - `keywords` (opt.): filter by keywords (perhaps even allowing for boolean queries) - `organizations` (opt.): filter by protest group(s) - return value: list of events with properties: - `id`: unique id across all sources, perhaps based on hashing - `date` - `organization` - `topic` - `description`

trends

- parameters: - `type` (req.): `keywords` / `topics` / `sentiments` - `source` (req.): `news_online/mediacloud` / `news_print/genios` / `new_print/dereko` / `social_media/twitter`, `speech/bundestag` / ... - and, similar as above: - `start_date` - `end_date` - `topic`: this would automatically set relevant keywords - if the `type` is `keywords` - `keywords` (req.): specify the keywords for the search query for the trend - ~~`organizations`~~: unlike protest events, trends are not directly related to the different organizations; one can still do a keyword search for the organization names; but having an `organizations` field here would not give proper credit to the more general discourse that may be impacted by protest groups but where they are not explicitly mentioned; tbd! - return value: list of entries with fields: - `date`: this runs through from `start_date` to `end_date` and may be 0 - depending on the requested type: - for `keywords`: `count` - for `topics`: `count_topic_1`, `count_topic_2`, ... - for `sentiments`: tbd! - if the `type` is `topic`: for each topic a description what it means

fulltexts

- parameters - `source` (req.): `news_online/mediacloud` / `new_print/dereko` / `social_media/twitter`, `speech/bundestag` = a similar list as above, but not all sources have fulltexts available - and, similar as above: - `start_date` - `end_date` - `topic`: this would automatically set relevant keywords - `keywords` - ~~`organizations`~~: see above - return value: list of fulltexts with properties: - `date` - `title` - `fulltext` - ...

impact

- parameters - `cause` (req.) - `protest_ids` (opt.) - alternatively, filter by the same parameters as for the events endpoint - `effect` (req.) - filter by the same parameters as for the `trends` endpoint - but `start_date`, `end_date`, `topic` should be consistent with the parameters that are set for `cause`; so maybe those should be specified outside of the `cause` / `effect` parameters - `method` (req.): `synthetic_control` / `interrupted_time_series` / `regression` / `doubly_robust` ; or maybe we always want to use all methods? - return value - `applicability`: `no` / `maybe` - `applicability_reason`: text with explanation: e. g. regression is not applicable if there are only very few events, synthetic control may not be applicable if there are no control regions - `impact_average`: time-series of average impacts (with avg, ci.95, ci.05) - `impact_single`: for each event, time-series of its impact; maybe this field would not be present for regression and doubly robust estimation? - `placebo_tests`: ...

vogelino commented 3 months ago

Great! Thx for starting that! Well structured too.

A few thoughts:

Events should also return the time range (when they started and ended), their source and keywords. In general, what can be used to filter should be also be returned
I like the idea of a trends route. It is meant as a way to easily get stats for visualisations right? Maybe a timerange should be defined. Or is it always 7 days?
For the impact route, I agree that having the start_date and end_date parameter on the first level might be a good idea

Overall I think this is a very good start of a structure and we should cover most needs with it! 💪

davidpomerenke commented 3 months ago

When querying events, It would be great to get the average impact directly, as well as a few metadata. I'm thinking event timeline here. It would be costly to have to request first all events and then request the impact for each individual event. For the metadata, another option would be to request on hover with an /event/event_id endpoint or something similar, so we can show the info on hover.

@vogelino

davidpomerenke commented 2 months ago

Probably won't implement individual impacts per event, because the data is too noisy.

vogelino commented 2 months ago

In my new designs, I visualise the "reach" of an article. Is there a good way to implement something like this? Or should we avoid any weighting of the event at all? @davidpomerenke

davidpomerenke commented 2 months ago

Good point!

We can calculate this. We want to get all (accessible online) articles about a protest for the table on our protest detail page. And "reach" could just be the number of articles there.
But I'm not sure if we want to put too much focus on "reach", since one of the core ideas is that we can measure the impact beyond reach (but not for individual events).

Maybe we could weight them by number of participants by default (not yet in the API, but PR is ready #75), and allow switching to reach? @vogelino

vogelino commented 2 months ago

We could if the amount of participants is a factor of impact in your opinion. I would avoid adding yet another option between reach and amount of participants. The complexity of the UI and the implementation grows exponentially every time a decision isn't taken and left to the users instead. I think we should decide for what makes most sense and start with that, and add the option later only if there is a very good justification for that. Talking from experience here... ':D

kleinlennart commented 1 month ago

The way date parameters are currently implemented is not very consistent or logical. Therefore, we should decide on:

consistent defaults (e.g., set an end_date default to current date on the highest possible level)
- generally, setting defaults in the Types seems arguably preferable compared to global scope vars in module scripts
consistent usage for all endpoints (where it makes sense)
...

davidpomerenke commented 1 month ago

We should determine the current date within the FastAPI routes in api.py, then it will always be the date of the request. Otherwise (if defined within global scope) it will be the date when the app has been started. Or we just set it in the frontend.

Having request_date as a parameter internally totally makes sense from a functional programming perspective, and especially with respect to caching. I wonder whether we also want to have it in the API. On the one hand, the REST philosophy implies (as I understand it) that a request always gives the same response. On the other hand, users may expect that the API always delivers data that is up to date.

I tend toward making it explicit in the API as well, and not setting a default in the backend (due to the mentioned complication).

Maybe we can also call it end_date again, rather than request_date.

vogelino commented 1 month ago

I tend to agree with the fact that sending start_date and end_date in the request body should only return data for this range. I would be fine if the API would either require both the start_date and end_date and to error if start_date is provided and end_date is not. If none of those parameters are provided, it should return everything or a paginated result with sensible defaults. These are just some thoughts and I trust you guys with API design more that myself. :) Just let me know if anything changes. :)

SocialChangeLab / media-impact-monitor

[ORG] API endpoint design #36