PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Document that "Source" is "last hop" and not "original" #1251

Open bickelj opened 1 month ago

bickelj commented 1 month ago

If I understand correctly, the primary use of the "system" source (provenance) is to fill in some source for old data prior to the existence of users declaring (or PDC automatically detecting) sources of data. While it was convenient to be able to associate some new proposal data with the "system" source in local testing, I think it should only be allowed for data prior to the introduction of the /sources endpoint. In other words, anybody that can post proposal data can add a new source, so we should require that. Otherwise, I'm confident we'll see data where the source chosen is the "system" source and that will be detrimental to provenance.

slifty commented 1 month ago

Two thoughts: There are some potential use cases for a system source -- for instance if we develop a PDC daemon that adds data programmatically somehow based on other data (e.g. "we want to generate lat / lngs from addresses"). We don't have that in place right now of course, but I guess I just mean that PDC could be a source of data.

It seems to me the ultimate issue here is that users should not be able to specify a source that they are not associated with. I think the best solution to this concern is to add permissions to sources (which is part of the road map). Right now any user can claim that their data is from any source, which is similarly detrimental to provenance.

Thoughts?

bickelj commented 1 month ago

The PDC daemon has a lot of options in that case:

  1. repeat the same source that was ultimately used,
  2. create a new source as a compound or related source, for example named "PDC generated from original source X using software Y", or
  3. create a simpler new source, for example named "Generated addresses."

It is doubtful that (1) is going to be the system source once real data are posted because right now all those system data are generated. Otherwise, yeah, I can see it being needed. But I'd prefer (2) or (3) anyway.

I think it's OK for users to be able to say "these data came from over there." For example, Foundation X got the data from Changemaker Y and marks it so.

Perhaps this raises a question as to what the meaning of "Source" is: is it "Ultimate source" or "last hop source"?

In any case, your note that it is intentional to keep the system source open to use is enough to defer/delay/de-prioritize this issue.

slifty commented 1 month ago

I think it's OK for users to be able to say "these data came from over there." For example, Foundation X got the data from Changemaker Y and marks it so.

I do disagree here -- the foundation ALWAYS is getting the data from Changemaker Y (they have a GMS, the changemaker entered their data into the GMS). However, when the foundation sends data to us from their GMS they are the source, not the changemaker.

Source is generally intended as meaning "last hop source" before entering our system. What was the most recent data set that we used to enter the data into the PDC, as opposed to "where did the data originally come from".

For instance, a data provider that scraped data from the IRS would not be saying the data source is the IRS -- it would be (e.g.) Charity Navigator.

All that said, our current implementation does not prevent individual users from making their own decisions about how to interpret source. When implementing the feature this was intentional from an MVP perspective, but leaves some of these questions open to interpretation.

bickelj commented 1 month ago

I see. I agree that keeping the "system" source is coherent with the "last hop" view. And I can see the utility of the last hop. I wish our provenance could go deeper, though, somehow. Let's not leave it up to interpretation, though, I'd rather document the expectations in the API.