data-preservation-programs / slingshot

Official public repository for feedback and data collection in Filecoin Slingshot
https://slingshot.filecoin.io
68 stars 250 forks source link

We need a solution for small file datasets - min size deals are 32 GiB #440

Closed peterVG closed 3 years ago

peterVG commented 3 years ago

Problem

Some Slingshot datasets like the Open Images dataset, contain many small files of just a few Megabytes each. These can be bundled into larger packages but anything bigger than 1 GiB creates a retrievability/usability issue. Future end user apps will not want to have to download 32GiB+ packages to get to a few 1 MB files. This will be true beyond Slingshot as well for developers wanting to build apps that use small sets of user data rather than large public datatsets. So IMHO this is a fundamental problem that Filecoin needs to address to be marketable going forward.

Deal batching has been proposed as a solution: https://github.com/filecoin-project/lotus/pull/5309/files/fb6c4d5c76e26f542e347570844be0b6ba345bb6

However, early reports from miners is that the burn fee on batched deals are way higher than other deals so it's not actually providing a solution to this problem.

Proposed Solution

Can anything be done in the FIlecoin market rules to reduce miner's costs so that it is economical for them to accept 1GiB or smaller deals?

Deal batching

Firstly, we need more feedback from miners on whether deal batching does in fact solve this problem or not. i.e. can I send a miner 1 GiB packages, they wait until they have 32 of them, and then they seal it as one batch. For one this will create potentially long delays in confirming sealed deals, i.e. the time a miner has to wait to create a batch could be hours, days... I'm also curious how this will affect retrieval expenses from the app developer perspective. Now, it's 6 months later and I need to get a few 1 MiB files out of that batched deal. Presumably I'm paying a lot more as an end-user to download it from a 32GiB batch file than I would from a 1 GiB file?

Fixed price deal-making costs

TBH I don't know my gas fee from burn fee from sealing fee. They may even all be the same thing AFAIK. To a Filecoin newcomer the only expenses that make sense is the cost of procuring hardware, the electricity to run it, the price of bandwidth up and down, and hard disk space over time. Why are those not be the main variable factors in how miners set their prices? Why is there separate market pricing for the sealing, etc. costs? It would be useful to see an ELI5 table that lists all the deal making/retrieval state changes that do incur costs and how they are priced. It would help non-miner, app developers like me to understand why there is more involved in pricing (and why I thereby can't get a -32GiB deal) without doing a deep dive into miner documentation.

Now what if:

Would that get us 256MiB, 1GiB, etc. deals?

IPLD selectors

Lastly, "IPLD-selectors" have been mentioned in one Slack conversation as a possible solution. Without understanding anything about how these work, it is suggested that they will enable the retrieval of a subset of stored data, i.e., you'd be able to pack multiple files into one larger deal to reduce storage overhead, and retrieval clients would be able to request to download the subset of data they're interested in instead of having to download the full deal size. That sounds promising. Can anything more be said about this potential solution?

peterVG commented 3 years ago

This isn't really a "rules-suggestion" but the issue got this label by default. It should be changed to a "discussion" label but I don't seem to have edit authority to do so.

cwhiggins commented 3 years ago

https://github.com/ipld/specs/blob/master/schemas/introduction.md#a-quick-ipld-primer Here is a Link for anyone interested in a IPLD primer. https://github.com/ipld/specs
& this one might be useful as well https://github.com/ipld/specs/blob/master/design/history/exploration-reports/2020.09-learning-ipld.md

pooja commented 3 years ago

Thanks for creating this issue @peterVG! Flagging this for @raulk @hannahhoward @rvagg @jnthnvctr to take a look at as well.

jennijuju commented 3 years ago

Thanks for creating this issue @peterVG! Flagging this for @raulk @hannahhoward @rvagg @jnthnvctr to take a look at as well.

@pooja can I transfer this ticket to lotus for tracking?

pooja commented 3 years ago

Just saw this, but yep go for it @jennijuju

dkkapur commented 3 years ago

@jennijuju - do you still want to transfer this one out or OK to close this?