filecoin-project / notary-governance

114 stars 58 forks source link

Modification: Store LDN application in JSON files #839

Closed panges2 closed 4 months ago

panges2 commented 1 year ago

TLDR

Update how we store information on GitHub to make everything better. If you are a notary it should not change how you interact with the system.

Context

The current system for storing data of all Fil+ LDN application data on Github as comments in issues is not scalable for various reasons. In this proposal, we aim to address the issues with the current system by updating the way core Fil+ LDN data is stored on GitHub.

Issues with the Current System:

  1. Consistent Querying: Consistently pinging the Github API for each comment on each application table of notaries is not a scalable solution.
  2. Regex Updates: Updating the format of LDN applications requires updating the regex, which is difficult to test and unreliable. Small changes in the format can cause the system to miss detecting crucial information.
  3. Inconsistent Data: The history of application data may be deleted without history, causing inconsistent data.
  4. Non-Atomic: Pinging GitHub may take a while, and the comment may be erased or changed during that time, making it non-atomic.
  5. Non-Standardization: The current system of storing LDN application data on Github is not standardized, which means that a third party that wants to store the data locally has to re-invent the process of retrieving and parsing the Github data. This can lead to inconsistency in the data and makes it difficult for third parties to integrate the data into their systems.

Role of Github

Why still use Github? GitHub provides a simple way to store large amounts of data with the following advantages:

  1. Openness: GitHub allows anyone to view, edit and contribute to the data, making it a collaborative platform.
  2. Cost-effective: Github is a free platform that eliminates the need for expensive database infrastructure and maintenance.
  3. Customer Relationship Management (CRM): Github provides a convenient way to organize and manage customer data for each application, making it a useful CRM tool.

Proposed new application flow

Benefits of this new flow

  1. Standardization
    1. Easier to scrape
    2. Easier to parse and do analysis on
    3. Easier to build input sources like filplus.storage
  2. T&T
    1. We don’t lose applications / application data just because an applicant decided to delete it.
    2. Commit history in addition to JSON history of each allocation
  3. Reliability
    1. Label management + comment automation on long-living issues is very complicated. Having more history to parse from unreliable writers is difficult. This system reduces the lifecycle of each “open topic” in GitHub to each allocation rather than an entire application
    2. More “atomic” as it is difficult to make PR compared to making a comment
  4. Maintaining core requirements of a Fil+ application DB
    1. DB is open and accessible
    2. DB is “free”
    3. DB can act as a CRM of sorts per application

Application Schema

Note: Still subject to change

{
"dataCapApplicationType": ["da", "ldn-v3", "e-fil"],
"projectID": 0,
"datacapApplicant" : "",
"applicationInfo": {
    "Core Information": {
        "Data Owner Name": "",
        "Data Owner Country/Region": "",
        "Data Owner Industry": "",
        "Website": 0,
        "Social Media": ""
    },
    "Project Details": {
        "Share a brief history of your project and organization": "",
        "Is this project associated with other projects/ecosystem stakeholders?": True,
        "If answered yes, what are the other projects/ecosystem stakeholders": ""
    },
    "Use-case Details": {
        "Describe the data being stored onto Filecoin": "",
        "Where was the data currently stored in this dataset sourced from": {"AWS Cloud", "Google Cloud", "Azure Cloud", "My Own Storage Infra", "other"},
            "If you answered 'Other' in the previous question, enter the details here": "",
            "How do you plan to prepare the dataset": {"IPFS", "Lotus", "Singularity", "Graphsplit", "others/custom tool"},
            "If you answered 'other/custom tool' in the previous question, enter the details here": "",
            "Please share a sample of the data (A link to a file, an image, a table, etc., are good ways to do this.)": "",
            "Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data)": True,
            "If you chose not to confirm, what was the reason": "",
            "What is the expected retrieval frequency for this data": {"Daily", "Weekly", "Monthly", "Yearly", "Sporadic", "Never"},
            "For how long do you plan to keep this dataset stored on Filecoin": {"Less than a year", "1 to 1.5 years", "1.5 to 2 years", "2 to 3 years", "More than 3 years", "Permanently"}
        },
        "Datacap Allocation Plan": {
            "In which geographies do you plan on making storage deals": {},
            "How will you be distributing your data to storage providers": {},
            "How do you plan to choose storage providers": {},
            "If you answered 'Other' in the previous question, what is the tool or platform you plan to use": "",
            "If you already have a list of storage providers to work with, fill out their names and provider IDs below": "",
            "How do you plan to make deals to your storage providers": {},
            "If you answered 'Others/custom tool' in the previous question, enter the details here": "",
            "Can you confirm that you will follow the Fil+ guideline (Data owner should engage at least 4 SPs and no single SP ID should receive >30% of a client's allocated DataCap)": "" 
        }       
    },
    "applicationLifecycle": {
        "validatedTime": 0, //datetime
        "firstAllocationTime": 0 //datetime
        "isActive": true //more DataCap is expected
        "timeOfNewState" : 0 //datetime for when it was last updated
    },
    "dataCapAllocations" [
        dataCapTranche {
            "uuid" : 0,
            "clientAddress": f1...
            "timeOfRequest" : 0, //datetime
            "timeOfAllocation" : 0, //datetime
            "notaryAddress" : "", //could be a multisig
            "allocationAmount" : 0,
            "signers": [
                {
                    "signingAddress" : ""
                    "timeOfSignature" : 0, //datetime
                    "messageCID" : "",
                },
                {
                    "signingAddress" : ""
                    "timeOfSignature" : 0, //datetime
                    "messageCID" : "",
                },
            ],
            "pr": 0,
            "pr-cid": bafy...
        }
    ]
}

Timeline

week 1-2: finalizing discussion week 3-4: finalizing design internally week 5-9: implementation week 9-14: testing and fixes

Technical dependencies

Tooling for registry and SSA bot will have to change and be redeployed

End of POC checkpoint (if applicable)

week 14

Risks and mitigations

cryptowhizzard commented 1 year ago

Looks great!

Userdata from github and comments will be available also to scrape?

How will this be connected to KYC?

AlexxNica commented 1 year ago

Great proposal! This also helps mitigate some risks mentioned in #793 (similar to @dkkapur's suggestion https://github.com/filecoin-project/notary-governance/discussions/793#discussioncomment-4269944). With this proposal we could use git natively for backups without the overhead of parsing every issue and comment, and enables us to use some other nice GitHub features like workflows/actions more effectively.

Some quick thoughts:

fabriziogianni7 commented 1 year ago

Technical Considerations

In order to make this new flow we need to implement a system of new branches, commit and PRs.

When a new application is created, or there is a new datacap request, we need to:

  1. create a new branch 1a. save the branch somewhere (how is the front-end going to know what branch to put the commit in ?)
  2. when a notary propose or approve a dc request, create a new commit in the branch
  3. when we have 2 signatures, merge the PR

We need to decide if having a unique, big JSON file with all the applications, or as many files as application: 1 big JSON file: pros: will be easy to scrape it, we will have all the info there. cons: if we merge 2 or more branches at the same time we will have a conflict and the merge won't take place, resulting in human intervention to unblock this situations. If there is any mistake, reverting the PR can be very hard and not secure

many files: pros: we shouldn't have conflict problems cons: we will have more than 1k files in the repo, we will have the same problem we have right now with fetching limits

huseyincansoylu commented 1 year ago

I agree with fabrizio about both technical considerations and the pros/cons of having 1 or many JSON file. I think we should discuss the above 3 items in more detail.

Maybe we can have JSON file periodically, such as every week, month. Maybe this can help both for merge and api limits problem.

panges2 commented 1 year ago

Hi @fabriziogianni7 @huseyincansoylu

panges2 commented 1 year ago

@cryptowhizzard

Aaron01230 commented 1 year ago

I support this proposal. The JSON format is API-friendly, but errors are likely to occur for LDN applicants, so an easy-to-use front-end tool is needed to help them generate this JSON

orvn commented 1 year ago

I've made a related broader discussion in #891, but I'd like to make some more specific points here:

This is a path forward that gives us the best of both worlds: (1) structure and automation and (2) transparency and community usability.

jbesraa commented 1 year ago

hey @orvn

  1. I think it will still be possible to start an application through github, I dont see a reason why not. as soon as we define how the pull request need to look(branch name, pr title, pr files , etc..) its just a matter of following the standard. I am not sure that will be easier than creating an application from file.storage or through an issue, but it will definitely be possible.

  2. I would rely solely on the data inside the json files as source of truth(at least from the code perspective), unless we define it otherwise in the standard.

orvn commented 1 year ago
  1. I think it will still be possible to start an application through github, I dont see a reason why not. as soon as we define how the pull request need to look(branch name, pr title, pr files , etc..) its just a matter of following the standard. I am not sure that will be easier than creating an application from file.storage or through an issue, but it will definitely be possible.

Yes, that's the goal. While most applications will come through filplus.storage or other future similar channels, the user should still be able to make a PR. A PR template could enforce some body format at least. I don't know if the forked branch naming convention matters too much (but if you have a reason I'm not thinking of lmk @jbesraa!)

  1. I would rely solely on the data inside the json files as source of truth(at least from the code perspective), unless we define it otherwise in the standard.

Agree, the jsons are the source of truth now, so basically we would want them merged into the codebase as efficiently as possible, so that the main branch is always as current as possible. The repo itself will only represent active DataCap allocations, and not unapproved applications like the issues do right now, correct?

jbesraa commented 1 year ago

We cant use forks if we want to keep using issues as an entry point(ie creating an application through an issue that will be transferred to PR). because we cant fork without having OAuth from the user.

Active pull requests will represent one of the following: 1 - new application 2 - removal request 3 - refill request