panges2 commented 1 year ago

TLDR

Update how we store information on GitHub to make everything better. If you are a notary it should not change how you interact with the system.

Context

The current system for storing data of all Fil+ LDN application data on Github as comments in issues is not scalable for various reasons. In this proposal, we aim to address the issues with the current system by updating the way core Fil+ LDN data is stored on GitHub.

Issues with the Current System:

Consistent Querying: Consistently pinging the Github API for each comment on each application table of notaries is not a scalable solution.
Regex Updates: Updating the format of LDN applications requires updating the regex, which is difficult to test and unreliable. Small changes in the format can cause the system to miss detecting crucial information.
Inconsistent Data: The history of application data may be deleted without history, causing inconsistent data.
Non-Atomic: Pinging GitHub may take a while, and the comment may be erased or changed during that time, making it non-atomic.
Non-Standardization: The current system of storing LDN application data on Github is not standardized, which means that a third party that wants to store the data locally has to re-invent the process of retrieving and parsing the Github data. This can lead to inconsistency in the data and makes it difficult for third parties to integrate the data into their systems.

Role of Github

Why still use Github? GitHub provides a simple way to store large amounts of data with the following advantages:

Openness: GitHub allows anyone to view, edit and contribute to the data, making it a collaborative platform.
Cost-effective: Github is a free platform that eliminates the need for expensive database infrastructure and maintenance.
Customer Relationship Management (CRM): Github provides a convenient way to organize and manage customer data for each application, making it a useful CRM tool.

Proposed new application flow

Client makes new application at filplus.storage
Assuming it passes UI validation, merge new .json with application data and empty DataCap history
immediate open new PR
- validatedTime is set at PR opening time, point at which it is ready for a DataCap allocation
- PR includes data for first tranche of DataCap / or only tranche of DataCap
- at this point, it shows up in the associated Notary UI for signing
- a notary can interact with the client directly
  - every interaction is initially just a comment in the PR
  - we can eventually create abstractions above this in the website / other tools
notary signs in plus.fil.org to give DataCap
PR is updated with signature stats from registry service
signature checking service runs (new service/bot)
- If signature threshold is met
- check on-chain messages for DataCap allocation success
- merge PR
if tranches are recurring / i.e., LDN app
- SubsequentAllocationBot opens up a new PR per tranche with properties filled out
- notaries go back and forth in comments
- automated tooling posts in comments
- notaries sign using registry app
- PR gets merged

Benefits of this new flow

Standardization
1. Easier to scrape
2. Easier to parse and do analysis on
3. Easier to build input sources like filplus.storage
T&T
1. We don’t lose applications / application data just because an applicant decided to delete it.
2. Commit history in addition to JSON history of each allocation
Reliability
1. Label management + comment automation on long-living issues is very complicated. Having more history to parse from unreliable writers is difficult. This system reduces the lifecycle of each “open topic” in GitHub to each allocation rather than an entire application
2. More “atomic” as it is difficult to make PR compared to making a comment
Maintaining core requirements of a Fil+ application DB
1. DB is open and accessible
2. DB is “free”
3. DB can act as a CRM of sorts per application

Application Schema

Note: Still subject to change

{
"dataCapApplicationType": ["da", "ldn-v3", "e-fil"],
"projectID": 0,
"datacapApplicant" : "",
"applicationInfo": {
    "Core Information": {
        "Data Owner Name": "",
        "Data Owner Country/Region": "",
        "Data Owner Industry": "",
        "Website": 0,
        "Social Media": ""
    },
    "Project Details": {
        "Share a brief history of your project and organization": "",
        "Is this project associated with other projects/ecosystem stakeholders?": True,
        "If answered yes, what are the other projects/ecosystem stakeholders": ""
    },
    "Use-case Details": {
        "Describe the data being stored onto Filecoin": "",
        "Where was the data currently stored in this dataset sourced from": {"AWS Cloud", "Google Cloud", "Azure Cloud", "My Own Storage Infra", "other"},
            "If you answered 'Other' in the previous question, enter the details here": "",
            "How do you plan to prepare the dataset": {"IPFS", "Lotus", "Singularity", "Graphsplit", "others/custom tool"},
            "If you answered 'other/custom tool' in the previous question, enter the details here": "",
            "Please share a sample of the data (A link to a file, an image, a table, etc., are good ways to do this.)": "",
            "Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data)": True,
            "If you chose not to confirm, what was the reason": "",
            "What is the expected retrieval frequency for this data": {"Daily", "Weekly", "Monthly", "Yearly", "Sporadic", "Never"},
            "For how long do you plan to keep this dataset stored on Filecoin": {"Less than a year", "1 to 1.5 years", "1.5 to 2 years", "2 to 3 years", "More than 3 years", "Permanently"}
        },
        "Datacap Allocation Plan": {
            "In which geographies do you plan on making storage deals": {},
            "How will you be distributing your data to storage providers": {},
            "How do you plan to choose storage providers": {},
            "If you answered 'Other' in the previous question, what is the tool or platform you plan to use": "",
            "If you already have a list of storage providers to work with, fill out their names and provider IDs below": "",
            "How do you plan to make deals to your storage providers": {},
            "If you answered 'Others/custom tool' in the previous question, enter the details here": "",
            "Can you confirm that you will follow the Fil+ guideline (Data owner should engage at least 4 SPs and no single SP ID should receive >30% of a client's allocated DataCap)": "" 
        }       
    },
    "applicationLifecycle": {
        "validatedTime": 0, //datetime
        "firstAllocationTime": 0 //datetime
        "isActive": true //more DataCap is expected
        "timeOfNewState" : 0 //datetime for when it was last updated
    },
    "dataCapAllocations" [
        dataCapTranche {
            "uuid" : 0,
            "clientAddress": f1...
            "timeOfRequest" : 0, //datetime
            "timeOfAllocation" : 0, //datetime
            "notaryAddress" : "", //could be a multisig
            "allocationAmount" : 0,
            "signers": [
                {
                    "signingAddress" : ""
                    "timeOfSignature" : 0, //datetime
                    "messageCID" : "",
                },
                {
                    "signingAddress" : ""
                    "timeOfSignature" : 0, //datetime
                    "messageCID" : "",
                },
            ],
            "pr": 0,
            "pr-cid": bafy...
        }
    ]
}

Timeline

week 1-2: finalizing discussion week 3-4: finalizing design internally week 5-9: implementation week 9-14: testing and fixes

Technical dependencies

Tooling for registry and SSA bot will have to change and be redeployed

End of POC checkpoint (if applicable)

week 14

Risks and mitigations

Engineering cost

cryptowhizzard commented 1 year ago

Looks great!

Userdata from github and comments will be available also to scrape?

How will this be connected to KYC?

AlexxNica commented 1 year ago

Great proposal! This also helps mitigate some risks mentioned in #793 (similar to @dkkapur's suggestion https://github.com/filecoin-project/notary-governance/discussions/793#discussioncomment-4269944). With this proposal we could use git natively for backups without the overhead of parsing every issue and comment, and enables us to use some other nice GitHub features like workflows/actions more effectively.

Some quick thoughts:

Is there a specific reason for using JSON? If not, I'd suggest using YAML instead of JSON so everyone can read and edit the file without much effort.
What is the start date for the timeline weeks? Could that be changed to specific dates to track progress more easily?

fabriziogianni7 commented 1 year ago

Technical Considerations

In order to make this new flow we need to implement a system of new branches, commit and PRs.

When a new application is created, or there is a new datacap request, we need to:

create a new branch 1a. save the branch somewhere (how is the front-end going to know what branch to put the commit in ?)
when a notary propose or approve a dc request, create a new commit in the branch
when we have 2 signatures, merge the PR

We need to decide if having a unique, big JSON file with all the applications, or as many files as application: 1 big JSON file: pros: will be easy to scrape it, we will have all the info there. cons: if we merge 2 or more branches at the same time we will have a conflict and the merge won't take place, resulting in human intervention to unblock this situations. If there is any mistake, reverting the PR can be very hard and not secure

many files: pros: we shouldn't have conflict problems cons: we will have more than 1k files in the repo, we will have the same problem we have right now with fetching limits

huseyincansoylu commented 1 year ago

I agree with fabrizio about both technical considerations and the pros/cons of having 1 or many JSON file. I think we should discuss the above 3 items in more detail.

Maybe we can have JSON file periodically, such as every week, month. Maybe this can help both for merge and api limits problem.

panges2 commented 1 year ago

Hi @fabriziogianni7 @huseyincansoylu

to your point about the branch. What we could do is name the branch in a certain way such that the front end can look through the list of branches, each with a unique ID as the name.
You raise a good point about many JSON files vs 1 big one. One possibility is to have many files, but archive the old inactive Files somewhere. This could be web3.storage or git LFS as recommended by @AlexxNica in the governance call. This would mean extra overhead though if we need to retrieve the archived JSON files.

panges2 commented 1 year ago

@cryptowhizzard

Userdata from github and comments should be able to scrape in this system as we save both in the JSON. if they change or delete it we can still see it in the git history. We plan for comment data on each PR to be saved on IPFS with the CID showing in the json.
This isn't related to KYC rather, its harder to delete a record once a client applies. It should also be easier to add more questions for a client application in this format.

Aaron01230 commented 1 year ago

I support this proposal. The JSON format is API-friendly, but errors are likely to occur for LDN applicants, so an easy-to-use front-end tool is needed to help them generate this JSON

orvn commented 1 year ago

I've made a related broader discussion in #891, but I'd like to make some more specific points here:

Support applications from Github - Right now this flow assumes that the client will make new application at filplus.storage: but we should also support creating a PR directly shouldn't we? I imagine we still want to give users the ability to easily to git to submit their own applications, opening PRs on Github?
Preserve Github labels - I suggest we use labels largely the same way we do now, because they give users a clear idea of the state of a DataCap application; the key difference is that the labels would be applied to PRs instead of issues (which have the same data structure within Github)
Treat in-repo content as source of truth - This refers to moving all shared references we can into the repo(s), which gives us a common source of truth; consider the following:
1. DataCap applications themselves, and their state (as proposed in this discussion)
2. Scripts for testing and field validation
3. CI to automate those scripts (Github Action)
4. Schema template (skeleton) that can be duplicated by users to create PRs
5. Common data model templates (e.g., list of countries, regions, tooling, etc.)
6. Documentation that explains the flow and process

This is a path forward that gives us the best of both worlds: (1) structure and automation and (2) transparency and community usability.

jbesraa commented 1 year ago

hey @orvn

I think it will still be possible to start an application through github, I dont see a reason why not. as soon as we define how the pull request need to look(branch name, pr title, pr files , etc..) its just a matter of following the standard. I am not sure that will be easier than creating an application from file.storage or through an issue, but it will definitely be possible.
I would rely solely on the data inside the json files as source of truth(at least from the code perspective), unless we define it otherwise in the standard.

orvn commented 1 year ago

I think it will still be possible to start an application through github, I dont see a reason why not. as soon as we define how the pull request need to look(branch name, pr title, pr files , etc..) its just a matter of following the standard. I am not sure that will be easier than creating an application from file.storage or through an issue, but it will definitely be possible.

Yes, that's the goal. While most applications will come through filplus.storage or other future similar channels, the user should still be able to make a PR. A PR template could enforce some body format at least. I don't know if the forked branch naming convention matters too much (but if you have a reason I'm not thinking of lmk @jbesraa!)

I would rely solely on the data inside the json files as source of truth(at least from the code perspective), unless we define it otherwise in the standard.

Agree, the jsons are the source of truth now, so basically we would want them merged into the codebase as efficiently as possible, so that the main branch is always as current as possible. The repo itself will only represent active DataCap allocations, and not unapproved applications like the issues do right now, correct?

jbesraa commented 1 year ago

We cant use forks if we want to keep using issues as an entry point(ie creating an application through an issue that will be transferred to PR). because we cant fork without having OAuth from the user.

Active pull requests will represent one of the following: 1 - new application 2 - removal request 3 - refill request

filecoin-project / notary-governance

Modification: Store LDN application in JSON files #839

TLDR

Context

Role of Github

Proposed new application flow

Benefits of this new flow

Application Schema

Timeline

Technical dependencies

End of POC checkpoint (if applicable)

Risks and mitigations

Technical Considerations