Open bajtos opened 1 year ago
@bajtos the only purpose of lassie needing to keep temporary files is to handle the duplicate blocks case in a traversal, which we can't avoid. Essentially we need a place to put blocks we've fetched in case we discover later in a traversal that we need to look at it again.
There's a secondary use-case for a temporary store in parallelising bitswap requests - we fetch ahead (preload) of the blocks we'll need when we discover them, so as we proceed with a linear traversal we discover that the blocks we want are already in our temporary store. It really speeds things up. Currently we put these blocks in the same per-retrieval temporary pool (CAR) which gets cleaned up after a request.
It's not a straightforward thing to deal with, but here's some thoughts about this topic that maybe we could riff off:
dups
parameter makes this interesting, both more difficult and less difficult depending on how you look at it.
dups=y
then we don't need to care about a blockstore, but we don't have clear guarantees at the moment, although the main http upstream implementation (eipfs) only does dups=y
and I think we could get fairly easily to a place where we have good guarantees about this. So potentially for http retrievals if we can force through a dups=y
then we wouldn't need a temporary store.dups=n
retrieval, where lassie isn't expected to provide duplicates on the way out, then we do have a path to skipping blocks that we've already seen, we just haven't used this yet. The traversals we use are relatively simple and linear, they don't have the kind of scenarios where you need to revisit the same block and make different decisions (there are branching selectors where this happens, but since we don't support arbitrary selectors we don't have to worry about that).Having said all of that, I think there's a path to having some kind of --notemp
flag, but opting in to that would mean a reduction in functionality. i.e. no dups=y
output, potentially turning off one or more retrieval protocols depending on the work needed to make them work properly for this case, and either turning off the bitswap preloader, or doing something more creative with those temporary blocks. There's a fair bit of work here in both doing the plumbing but also making sure we have enough test coverage for cases where it matters—most synthetic DAG tests end up not needing temporary storage, it's always edge cases where you end up with surprising blow-ups.
I can imagine plugging in our custom block store
Since you're not using Lassie as a library, but as a daemon, what might this look like? We'd presumably need an API to talk to a blockstore across some RPC boundary.
(Posting here in case others Google tempfiles and/or /tmp) I managed to fill /tmp on my test box. I had to hunt around to find the option --tempdir.
lassie fetch --tempdir . -o output.car -p bafy.........
This is not mentioned in the help section.
$lassie --help
NAME:
lassie - Utility for retrieving content from the Filecoin network
USAGE: lassie [global options] command [command options] [arguments...]
COMMANDS: daemon Starts a lassie daemon, accepting http requests fetch Fetches content from the IPFS and Filecoin network version Prints the version and exits help, h Shows a list of commands or help for one command
GLOBAL OPTIONS: --verbose, -v enable verbose mode for logging (default: false) --very-verbose, --vv enable very verbose mode for debugging (default: false) --help, -h show help
lassie help fetch
or lassie help daemon
would give you these advanced options
@distora-w3 how large are your downloads that you managed to fill up temp? Or do you just have a particularly small temp?
@rvagg Thank you for the detailed explanation!
I am fine with Lassie creating temporary CAR-store files, there is nothing wrong with that!
I am looking for a way how to limit the maximum amount of storage used by Lassie at any time.
For example, when a Station module makes 10 retrieval requests in parallel, each request for a 10 GB UnixFS archive, I don't want Station/Lassie to consume 100 GB of available disk space. (Think about users running on low-end laptops with 256 GB storage. There may not even have 100 GB available.)
I would like to tell Lassie "you can use at most 10 GB". When Lassie reaches this limit, I want it to abort requests in progress with an error, similarly to how you abort when MaxBlocks
limit is reached.
Nice to have: abort the requests one by one while removing their temp CAR files until we have enough free space to finish the remaining requests.
I understand this may be way out of the scope of Lassie, that's why I am thinking about a Station-specific CAR store implementation.
Since you're not using Lassie as a library, but as a daemon, what might this look like? We'd presumably need an API to talk to a blockstore across some RPC boundary.
I am using Lassie as a library providing the HTTP daemon. Here is the source code for our Go func InitDaemon
:
Zinnia, the runtime powering Filecoin Station, calls InitDaemon
from the Rust side via FFI.
Firstly, Lassie should clean up its temporary files on a per-request basis, if it's not then something wrong. The only case where Lassie should be leaving temporary files around is in the case of a hard termination where it doesn't have a chance to clean up. Once a request has ended, or been closed, then any temporary files should be cleaned up.
Yes, this seems to work great in my experience!
In zinniad
, which runs in Filecoin Station, we delete any remaining Lassie temp files during startup to clean up after hard termination. See https://github.com/filecoin-station/zinnia/pull/258
If we had a strictly
dups=n
retrieval, where lassie isn't expected to provide duplicates on the way out, then we do have a path to skipping blocks that we've already seen, we just haven't used this yet. The traversals we use are relatively simple and linear, they don't have the kind of scenarios where you need to revisit the same block and make different decisions (there are branching selectors where this happens, but since we don't support arbitrary selectors we don't have to worry about that).
Nice, I was not aware of this feature! Is my understanding correct that by sending the request header Accept: application/vnd.ipld.car;order=dfs;version=1;dups=n;
, we tell Lassie to take a different execution path that requires less storage space (or will use less storage in the future, once you optimise the implementation for dups=n
).
Are there any downsides to be aware of? E.g. can we still verify the correctness of the retrieved data when the duplicate blocks are omitted from the CAR stream?
ATM, SPARK (our retrieval checker) streams the CAR file from Lassie and does not interpret it in any way. I think it's safe to enable dups=n
in our retrieval requests, WDYT?
/cc @juliangruber
Having said all of that, I think there's a path to having some kind of --notemp flag, but opting in to that would mean a reduction in functionality. i.e. no dups=y output, potentially turning off one or more retrieval protocols depending on the work needed to make them work properly for this case, and either turning off the bitswap preloader, or doing something more creative with those temporary blocks. There's a fair bit of work here in both doing the plumbing but also making sure we have enough test coverage for cases where it matters—most synthetic DAG tests end up not needing temporary storage, it's always edge cases where you end up with surprising blow-ups.
As mentioned in my previous comment, we are okay with Lassie creating temporary files, there is no need to disable this part. We would like to limit the maximum amount of disk space these temporary files use. As a safety measure preventing the host machine from running out of storage space.
Thank you again for your time and energy in this discussion! 🙏🏻
lassie help fetch
orlassie help daemon
would give you these advanced options@distora-w3 how large are your downloads that you managed to fill up temp? Or do you just have a particularly small temp?
Thanks for the usage and clarification. As for temp size, it was more about not knowing and walking into the issue.
Is my understanding correct that by sending the request header
Accept: application/vnd.ipld.car;order=dfs;version=1;dups=n;
, we tell Lassie to take a different execution path that requires less storage space
No, but it in theory it could! That work hasn't been done, but if prioritised we could take on a project to ensure that (a) lassie has a clear understanding of whether it'll be safe to not persist blocks to produce correct output and run its retrievers and (b) actually do that, including saying SkipMe
when a retriever attempts to load a block that it's already seen before. The trick is knowing that a SkipMe
is a safe thing to do and it's not going to hurt either during retrieval or during output assembly.
dups=y
is actually a very recent addition to the spec and lassie, until recently it was never spitting out duplicates. Arguably it's the superior way to produce a DAG like this because you don't run the risk of massive bloat—imagine a file that's mostly all 0
bytes, the chunker will have fun repeating the same block for most of that but a dups=y
response is going to give you that same block over and over.
There's two reasons for dups=y
:
--duplicates=y
or --no-cache
to car extract
then I could do away with that and error if a duplicate is requested. Then for a dups=y
response from lassie, a lassie fetch -o - bafy... | car extract --no-cache | ffplay -
could actually stream with no intermediate buffering. Right now that flow has buffering both in lassie and go-car because of this.So that's for you to consider how it fits in your workflows. If you are storing blocks and you want your responses to be as efficient as possible, give it a dups=n
and you'll get back perfectly valid trustless CARs that don't have duplicate blocks.
As for temporary file restrictions - I think we could accommodate you on this one and it might be a generally useful feature, including for Saturn I think. We'd have to know the requirements though. We could track bytes written to our temporary stores and keep a running total, but deciding when to take action and what action to take would be the interesting problem that would need defining. If it's per-retrieval then it's easier, if it's across-retrievals then you have to decide which retrievals to cancel if you go beyond maximum.
We are running Lassie in Filecoin Station - an app that runs on desktop computers of (possibly non-technical) users. We want Stations to be unobtrusive to the user and leave plenty of resources (e.g. free disk space) for user workloads.
To achieve that, we would like to limit how much space Lassie can use for temporary files.
A new configuration option for Lassie (the library) or Lassie Daemon (the HTTP server) would be ideal.
How easy or difficult would the implementation be? Are there any other options that would allow us to limit the maximum amount of disk space used? For example, I can imagine plugging in our custom block store if the Lassie library/daemon support that.
/cc @juliangruber