Closed ddierkes closed 6 years ago
Seems reasonable to me. Unfortunately, while adding this functionality to the API would be pretty easy, in the CLI it is going to be a little bit trickier.
Some issues I can think of off the top of my head:
fetch.txt
? As a file? Either?--resolve-fetch
argument, so that either argument would have to be refactored (breaking backward compatibility), or we'd have to add a new arg, e.g.,--select-fetch
or somesuch.I can see the utility here, it just needs to be spec'ed out some more.
If we're following IPFS practices, the only way to get the thing you want is to call it by its hash. That way you are sure you are getting the thing you want. But with a CLI, that would require some cutting and pasting which is a task in some shells. Calling by line number is easiest but also problematic. For my specific usecase, it would be most convenient to call by file extension. Or by the same wildcards you can use with the mv and cp commands.
I'm looking at this tool for ingest into an internal preservation system and not for public sharing (so minids with a public lookup are a conundrum). It would be great for my use case for a somewhat empty bag to be thrown into the system as a .tgz and for the system to then fetch all the non-massive files. So .json yes, but mp4 no.
Being able to specify wildcard patterns sounds like a good compromise between utility and scope creep.
One relatively simple way to implement this would be using Python's fnmatch
library, which would give you the same syntax as Unix shell commands like mv
and cp
. Supporting a disjunctive array of filters would also be nice so that you could include multiple file types via extension, e.g., [*.txt, *.json]
, etc. Unfortunately, expressing anything more complex (like negation) is a bit of a pain.
Alternatively, it could be implemented using regular expression matching. That would ultimately be more flexible/powerful but introduces more complexity. However, in this case an input array of filters is not really necessary since any disjunctive conditions could be expressed in the regex. This approach is probably more future-proof; i.e., I could see things getting to a point where fnmatch
wasn't good enough anymore and needed to be switched out for re
, while the reverse is less likely.
re is certainly more powerful than fnmatch or glob, but the syntax is not not quite as easy. I haven't quite figured out your code enough to contribute cleanly, so it is your choice. I could go back and add docstrings to your classes and functions though if you'd like.
I just noticed your other question.
As a file? Either?
Throwing a file together can be a lot more powerful than a regular expression in some cases. For instance, wget takes a url or a text file full of urls separated by line breaks. If you set it up right, you can wget an awful lot of disparate stuff from one text file.
We have a release pending this week, so it is not likely I will be able to do anything with this for the upcoming release. However, after we get the release cut, I'll create a branch for this and prototype something and we can take it from there.
Regarding docstrings, point noted. That should definitely be done at some point. Feel free to file an issue about that if you like, but a PR is probably not necessary. In the meantime, there is API documentation here which can be used as a reference.
So, for another part of the code (bdbag-utils
), I recently needed to implement a simple filter expression mechanism for generating remote-file-manifests
from various sources. It occurred to me that the same mechanism could be used to implement partial-fetch, so I have done so in an experimental branch: https://github.com/fair-research/bdbag/tree/partial-fetch.
It's pretty flexible while at the same time tries to keep things simple. There is a new argument --fetch-filter
that takes a string of the form: <column><operator><value>
where:
column
is one of the following literal values corresponding to the field names in fetch.txt
: url
, length
, or filename
<operator>
is one of the following predefined tokens:
Operator | Description |
---|---|
== | equal |
!= | not equal |
=* | wildcard substring equal |
!* | wildcard substring not equal |
^* | wildcard starts with |
$* | wildcard ends with |
> | greater than |
>= | greater than or equal to |
< | less than |
<= | less than or equal to |
value
is a string or integerWith this mechanism you can do various string-based pattern matching on filename
and url
. Using missing
as the mode for --resolve-fetch
, you can invoke the command multiple times with a different filter to perform a effective disjunction. For example:
bdbag --resolve-fetch missing --fetch-filter filename$*.txt ./my-bag
bdbag --resolve fetch missing --fetch-filter filename^*README ./my-bag
bdbag --resolve fetch missing --fetch-filter filename==data/change.log ./my-bag
bdbag --resolve fetch missing --fetch-filter url=*/requirements/ ./my-bag
The above commands will get all files ending with ".txt", all files beginning with "README", the exact file "data/change.log", and all urls containing "/requirements/" in the url path.
You can also use length
and the integer relation operators to easily limit the size of the files retrieved, for example:
bdbag --resolve-fetch all --fetch-filter length<=1000000
Would this general filter mechanism satisfy your use case? Your feedback is welcome.
Also, I am not opposed to providing a file-based solution to this as well, but I think something like that should be much simpler by design, e.g., a simple newline-delimited list of URLs to cross-reference against URLs in fetch.txt
and only download on intersections.
I am not quite to the part of my project where I would be implementing a partial fetch. Your solutions sounds good. I'm not sure exactly what "missing" means though.
I'm pretty unclear though on where you store 'remote-file-manifests' in a ro-bag. Is remote-files.json a working file unique to bdbags? JSON is certainly cleaner than the somewhat ugly text files LoC invented, but I feel weird adding arbitrary metadata files that aren't clearly specified somewhere.
For --resolve-fetch
, the all
and missing
keywords just specify how to handle files that may or may not have already been fetched. With all
, everything is downloaded again, regardless of whether it already exists in the bag payload or not, where missing
just gets files that are not already in the payload. In the context of a multi-command fetch with filters, using missing
just helps so that files are not re-downloaded in the case of filters that overlap and include the same content more than once.
The remote-file-manifest
is just a bdbag
-specific format (or working file, as you write) for generating bags with remote and/or local payloads. It does not get added to the bag in any way, it is simply a metadata "driver" file for creation of bags.
The bagit-ro
support is completely optional (and doesn't have anything to do with partial-fetch), and you do not need to use it if you feel it is unnecessary for your use case. If you are looking at the entire diff of this branch against master you are seeing other changes unrelated to partial-fetch, but intended to be included in the next software release.
I've included this functionality in the latest release. I'd like to close the issue if there are no more comments.
If the dbbag is multiple TBs in size and I just want to pull in select files, shouldn't I be able to --resolve-fetch selectively? 'missing' and 'all' can both be fire hoses.