filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] Kernelogic - Various open datasets onboarding #457

Closed kernelogic closed 1 year ago

kernelogic commented 2 years ago

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

This will probably be my last individual slingshot 2.x LDN. And this Genomic Data Commons dataset has not been stored by many teams yet.

I have successfully completed a few LDNs on other datasets and I have record to show I have been following the rules of decentralization and have zero self dealing.

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/60
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/59
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/46
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/297
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/298
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/304

What is the primary source of funding for this project?

I am hoping to utilize this datacap at no cost, on active community providers on Slack, especially SPX / enterprise-sp-wg members. 

What other projects/ecosystem stakeholders is this project associated with?

Slingshot, enterprise-sp-wp.

Use-case details

Describe the data being stored onto Filecoin Archival data from the Genomic Data Commons.

Due to the original dataset download restrictions, this LDN now is repurposed to store various AWS open datasets qualified for existing Slingshot 2 and upcoming 3.

Where was the data in this dataset sourced from? https://portal.gdc.cancer.gov/

1. (New) Ford Multi-AV Seasonal Dataset
2. (New) Cancer Cell Line Encyclopedia (CCLE)
3. (New) Allen Brain Observatory - Visual Coding AWS Public Data Set
4. (Existing) Fly Brain Anatomy
5. (Existing) Foldingathome COVID-19
5. (Existing) NASANEX

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this. Raw sequence data, stored as BAM files, make up the bulk of data stored at the NCI Genomic Data Commons (GDC). https://portal.gdc.cancer.gov/files/54ac0975-cce0-40a9-a557-9a1c938ce167

https://registry.opendata.aws/ford-multi-av-seasonal/
https://registry.opendata.aws/ccle/
https://registry.opendata.aws/allen-brain-observatory/
https://registry.opendata.aws/janelia-flylight/
https://registry.opendata.aws/foldingathome-covid19/
https://registry.opendata.aws/nasanex/

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Curated dataset https://github.com/filecoin-project/slingshot/blob/master/datasets.md

What is the expected retrieval frequency for this data?

I am expecting some retrievals during prize judging period, as well as anyone interested in downloading this dataset.

For how long do you plan to keep this dataset stored on Filecoin?

As slingshot rule, minimum 1 year. Most likely 520 days.

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

All regions.

How will you be distributing your data to storage providers? Is there an offline data transfer process?

I will upload my prepared CAR files to a web server and coordinate with providers to download and propose offline deals.

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

I plan to deal with SPX, approved slingshot restore SPs and enterprise-sp-wg members, as well as any real community providers who are interested.

To name a few from the community that I deal with regularly: PIKNIK, Holon, CabrinaHuang, HarryM, BigBear, j1v, XinAn Xu, WillTechMusing.

Also exploring auction on https://www.bigd.exchange/

How will you be distributing deals across storage providers?

Evenly across all providers I propose to, if they can handle. If a miner is a notary itself, this notary will receive no more than 10% of the total granted datacap.

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

I have all I need to start making deals.
large-datacap-requests[bot] commented 2 years ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
large-datacap-requests[bot] commented 2 years ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

dkkapur commented 2 years ago

Datacap Request Trigger

Total DataCap requested

5 PiB

Expected weekly DataCap usage rate

500 TiB

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

large-datacap-requests[bot] commented 2 years ago

DataCap Allocation requested

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

DataCap allocation requested

250TiB

newwebgroup commented 2 years ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecyhkqndhk7m46k5b7r7rlhfywj5biqjfolbpuboepvxcen7f3ss6

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

250.00TiB

Signer Address

f1e77zuityhvvw6u2t6tb5qlnsegy2s67qs4lbbbq

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecyhkqndhk7m46k5b7r7rlhfywj5biqjfolbpuboepvxcen7f3ss6

liyunzhi-666 commented 2 years ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacec6l2kdw5ulfmeygtc56kn53dbn4bxu3chdmnd4bvxfip4y7bk34i

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

250.00TiB

Signer Address

f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacec6l2kdw5ulfmeygtc56kn53dbn4bxu3chdmnd4bvxfip4y7bk34i

kernelogic commented 2 years ago

I want to make some changes to this LDN. The dataset "Genomic Data Commons" originally applied with this application is actually mostly private, requires the downloader to be a medical professional in the US. I did not realize this until start download.

To not make this LDN go to waste, also v2.8 is going to end soon, I propose changing the dataset of this LDN to be V3 datasets that I prepared already, namely the following for now:

  1. Ford Multi-AV Seasonal Dataset
  2. Cancer Cell Line Encyclopedia (CCLE)
  3. Allen Brain Observatory - Visual Coding AWS Public Data Set
large-datacap-requests[bot] commented 2 years ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

DataCap allocation requested

500TiB

large-datacap-requests[bot] commented 2 years ago

Stats & Info for DataCap Allocation

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Last two approvers

liyunzhi-666 & newwebgroup

Rule to calculate the allocation request amount

100% of weekly dc amount requested

DataCap allocation requested

500TiB

Total DataCap granted for client so far

250TiB

Datacap to be granted to reach the total amount requested by the client (5 PiB)

4.75PiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
5906 5 250TiB 30.64 57.39TiB
NiwanDao commented 2 years ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecqt6ls7nno7qllip2ud3wte4rtdwy7xuu6nce6gg6xdbxgcs3zio

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

500.00TiB

Signer Address

f1a2lia2cwwekeubwo4nppt4v4vebxs2frozarz3q

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecqt6ls7nno7qllip2ud3wte4rtdwy7xuu6nce6gg6xdbxgcs3zio

Destore2023 commented 2 years ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacecouzitfwn745iamsihmjm3ecj6hwpblmbq2xcqe4trjis6kmm3is

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

500.00TiB

Signer Address

f1yh6q3nmsg7i2sys7f7dexcuajgoweudcqj2chfi

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecouzitfwn745iamsihmjm3ecj6hwpblmbq2xcqe4trjis6kmm3is

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 3

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

DataCap allocation requested

1000.0TiB

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Last two approvers

swatchliu & xingjitansuo

Rule to calculate the allocation request amount

200% of weekly dc amount requested

DataCap allocation requested

1000.0TiB

Total DataCap granted for client so far

750TiB

Datacap to be granted to reach the total amount requested by the client (5 PiB)

4.26PiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
22301 11 500TiB 14.88 106.65TiB
psh0691 commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceadt77pd3gveekzyjeukb45dh5prv3ugt3dfqpek6vc5xvpc2hiqg

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

1000.00TiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceadt77pd3gveekzyjeukb45dh5prv3ugt3dfqpek6vc5xvpc2hiqg

xinaxu commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacec46hfq7b5gohzdiomw2mtlmpjp2nzrxqbb4kmej3lomeropkrs3e

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

1000.00TiB

Signer Address

f1k3ysofkrrmqcot6fkx4wnezpczlltpirmrpsgui

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacec46hfq7b5gohzdiomw2mtlmpjp2nzrxqbb4kmej3lomeropkrs3e

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 4

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

DataCap allocation requested

1.95PiB

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Last two approvers

xinaxu & psh0691

Rule to calculate the allocation request amount

400% of weekly dc amount requested

DataCap allocation requested

1.95PiB

Total DataCap granted for client so far

1.70PiB

Datacap to be granted to reach the total amount requested by the client (5 PiB)

3.29PiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
50058 18 1000.0TiB 10.75 219.46TiB
xiaoyuaiheshui commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacea6tv2naij2kqwrw7sgbilpkpm3b7jdosbcnpy7fn5pqagbej3fuw

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

1.95PiB

Signer Address

f122qmy25wdtt5mxd77kndiq7z5x2n3iwiuz2wdsa

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacea6tv2naij2kqwrw7sgbilpkpm3b7jdosbcnpy7fn5pqagbej3fuw

xinaxu commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaced2jsorpgcioal3r3yg3cxjfgjtrrftlmbxbptc446wv7nhfulvzm

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

1.95PiB

Signer Address

f1k3ysofkrrmqcot6fkx4wnezpczlltpirmrpsgui

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaced2jsorpgcioal3r3yg3cxjfgjtrrftlmbxbptc446wv7nhfulvzm

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 5

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

DataCap allocation requested

1.34PiB

Id

9c80b58d-7271-4ccf-b4e2-78343f275e12

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f01858410

Client address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Last two approvers

xinaxu & jggapp

Rule to calculate the allocation request amount

800% of weekly dc amount requested

DataCap allocation requested

1.34PiB

Total DataCap granted for client so far

3.65PiB

Datacap to be granted to reach the total amount requested by the client (5 PiB)

1.34PiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
103074 26 1.95PiB 9.14 492.60TiB
newwebgroup commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacednhp4rvpf6vkrqqofsffrcvmtqdgcdbvjzsv3wqheixh5ohrbgbs

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

1.34PiB

Signer Address

f1e77zuityhvvw6u2t6tb5qlnsegy2s67qs4lbbbq

Id

9c80b58d-7271-4ccf-b4e2-78343f275e12

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacednhp4rvpf6vkrqqofsffrcvmtqdgcdbvjzsv3wqheixh5ohrbgbs

liyunzhi-666 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebpu7xb7s6il7o45xhbxvm4ypr45mctyb4whlzbf3yto23ckwuvg2

Address

f1euejrtpg5vphqzydzleld2vgxfkhbrueiomz54y

Datacap Allocated

1.34PiB

Signer Address

f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq

Id

9c80b58d-7271-4ccf-b4e2-78343f275e12

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebpu7xb7s6il7o45xhbxvm4ypr45mctyb4whlzbf3yto23ckwuvg2

filplus-checker commented 1 year ago

DataCap and CID Checker Report[^1]

If this is the first time a provider takes verified deal, it will be marked as new.

For most of the datacap application, below restrictions should apply.

✔️ Storage provider distribution looks healthy.

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f0240185 Clifton, New Jersey, US 285.73 TiB 7.40% 285.73 TiB 0.00%
f02301 San Jose, California, US 275.41 TiB 7.13% 275.41 TiB 0.00%
f03223 San Jose, California, US 275.03 TiB 7.12% 275.03 TiB 0.00%
f0143858 Clifton, New Jersey, US 271.51 TiB 7.03% 271.51 TiB 0.00%
f01877571 Singapore, Singapore, SG 224.89 TiB 5.83% 224.89 TiB 0.00%
f01882177 Singapore, Singapore, SG 224.89 TiB 5.83% 224.89 TiB 0.00%
f01880047 Singapore, Singapore, SG 224.86 TiB 5.82% 224.86 TiB 0.00%
f01882184 Singapore, Singapore, SG 224.83 TiB 5.82% 224.83 TiB 0.00%
f01878005 Singapore, Singapore, SG 224.68 TiB 5.82% 224.68 TiB 0.00%
f01859603 Shenzhen, Guangdong, CN 206.04 TiB 5.34% 205.07 TiB 0.47%
f01933917 Hong Kong, Central and Western, HK 175.66 TiB 4.55% 175.66 TiB 0.00%
f01926635 Hong Kong, Central and Western, HK 175.66 TiB 4.55% 175.66 TiB 0.00%
f01885088 Hong Kong, Central and Western, HK 175.19 TiB 4.54% 175.19 TiB 0.00%
f01900823 Jinan, Shandong, CN 132.09 TiB 3.42% 131.28 TiB 0.62%
f01128206 Hong Kong, Central and Western, HK 131.78 TiB 3.41% 131.25 TiB 0.40%
f0216849new Xiamen, Fujian, CN 99.16 TiB 2.57% 99.16 TiB 0.00%
f01923786 Hong Kong, Central and Western, HK 61.32 TiB 1.59% 60.70 TiB 1.02%
f01938674 Shenzhen, Guangdong, CN 61.01 TiB 1.58% 60.73 TiB 0.46%
f01923787 Shenzhen, Guangdong, CN 60.95 TiB 1.58% 60.76 TiB 0.31%
f01660795 Shenzhen, Guangdong, CN 60.81 TiB 1.58% 60.81 TiB 0.00%
f01938671 Hong Kong, Central and Western, HK 57.17 TiB 1.48% 56.82 TiB 0.60%
f01901739new Sydney, New South Wales, AU 37.51 TiB 0.97% 37.26 TiB 0.67%
f01927554new Shenzhen, Guangdong, CN 37.38 TiB 0.97% 37.38 TiB 0.00%
f01119939 Dallas, Texas, US 37.29 TiB 0.97% 37.29 TiB 0.00%
f01896030new Singapore, Singapore, SG 37.29 TiB 0.97% 37.29 TiB 0.00%
f01222595 Moscow, Moscow, RU 37.20 TiB 0.96% 37.20 TiB 0.00%
f01402814 Singapore, Singapore, SG 35.98 TiB 0.93% 35.98 TiB 0.00%
f01821041 Vancouver, British Columbia, CA 6.38 TiB 0.17% 6.34 TiB 0.49%
f047419 North Prairie, Wisconsin, US 2.69 TiB 0.07% 2.69 TiB 0.00%
f0164291 Hong Kong, Central and Western, HK 96.00 GiB 0.00% 96.00 GiB 0.00%

Provider Distribution

Deal Data Replication

The below table shows how each many unique data are replicated across storage providers.

✔️ Data replication looks healthy.

Unique Data Size Total Deals Made Number of Providers Deal Percentage
288.00 GiB 288.00 GiB 1 0.01%
480.00 GiB 960.00 GiB 2 0.02%
175.19 TiB 525.56 TiB 3 13.61%
1.00 TiB 5.00 TiB 5 0.13%
18.75 TiB 112.84 TiB 6 2.92%
4.66 TiB 32.66 TiB 7 0.85%
30.55 TiB 244.69 TiB 8 6.34%
89.86 TiB 809.08 TiB 9 20.96%
26.94 TiB 270.00 TiB 10 6.99%
26.34 TiB 290.25 TiB 11 7.52%
57.00 TiB 685.00 TiB 12 17.74%
67.91 TiB 883.72 TiB 13 22.89%
32.00 GiB 448.00 GiB 14 0.01%

Replication Distribution

Deal Data Shared with other Clients

The below table shows how many unique data are shared with other clients. Usually different applications owns different data and should not resolve to the same CID.

⚠️ CID sharing has been observed.

Other Client Application Total Deals Affected Unique CIDs Verifier
f3wkoaov5p4ghrjtfk4lcwtloc3mr3vunxk37mkfm
h4lwiqj5thj7wddjgvzaqz5eqdzgvbfrhgqhrvxzr
mxva
Fei Yan - Kernelogic 616.48 TiB 3,405 LDN # 60
f3xbtxeptbmiuevsbjnnsntx6p37a7q6lbw4x7gdk
dkiih2ha4qlymoayd2xwk2mpgbwdvgievlje7eivh
3e3q
Fei Yan - Kernelogic 534.18 TiB 2,189 LDN # 59
f1pkrmygbvweykpjcut36lf7ewgqdfhjklbhvepda Protocol Labs ( project: Slingshot Evergreen ) 284.02 TiB 2,890 LDN # 293
f1eknn564qwkfjwwzscbef2yc4itr5r6om3bq7yvi Fei Yan - Kernelogic 3.41 TiB 11 LDN # 297
f15djc5avdxihguvpmtej24rzbvnnqvzurxe55kja Kernelogic - Fei Yan - Slingshot Restore 1.69 TiB 54 LDN # 136
f3xgxfjtantx2qckdic7rqfmbcjjhygtbplxthcko
3s3xyzhqa2l5bqtnkiwwettate653mvyghns4zdhm
bdhq
Glif auto verified 480.00 GiB 3 Jonathan Schwartz
f3waskb5svh6ywfzhm2khvsomjlgb5rif6flf2bcy
n3ift7xswlffifn32jmvttf7o43zchku2manh7y34
w4pq
Unknown 352.00 GiB 1 Unknown
f3qcjrlgww75cagd5scykoxusipbd4gfpbmfjwbt3
cawkvxn5dnt4h7wsadzeazqrzxg5mdlytshq3dcbh
kspa
Unknown 64.00 GiB 1 Unknown
f3qimp33d5mdov4dzd6oc7nzkmnepeyivg7yshdp7
mmxwml2nkes4wnuzbsu6yfbyjfiflpeqlc4x4lzr7
6kaa
Glif auto verified 32.00 GiB 1 Jonathan Schwartz

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

kernelogic commented 1 year ago
  • Ford Multi-AV Seasonal Dataset
  • Cancer Cell Line Encyclopedia (CCLE)
  • Allen Brain Observatory - Visual Coding AWS Public Data Set

Disclosure again: in addition to the 3 datasets above, data from slingshot restore and #1104 are also shared in here for additional replicas. Since they are all open datasets, I don't think it will raise any issue on T&T.

large-datacap-requests[bot] commented 1 year ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
large-datacap-requests[bot] commented 1 year ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
large-datacap-requests[bot] commented 1 year ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!