fair-research / bdbag

Big Data Bag Utilities
https://fair-research.org
Apache License 2.0
49 stars 23 forks source link

partial resolve-fetch #20

Closed ddierkes closed 6 years ago

ddierkes commented 6 years ago

If the dbbag is multiple TBs in size and I just want to pull in select files, shouldn't I be able to --resolve-fetch selectively? 'missing' and 'all' can both be fire hoses.

mikedarcy commented 6 years ago

Seems reasonable to me. Unfortunately, while adding this functionality to the API would be pretty easy, in the CLI it is going to be a little bit trickier.

Some issues I can think of off the top of my head:

I can see the utility here, it just needs to be spec'ed out some more.

ddierkes commented 6 years ago

If we're following IPFS practices, the only way to get the thing you want is to call it by its hash. That way you are sure you are getting the thing you want. But with a CLI, that would require some cutting and pasting which is a task in some shells. Calling by line number is easiest but also problematic. For my specific usecase, it would be most convenient to call by file extension. Or by the same wildcards you can use with the mv and cp commands.

I'm looking at this tool for ingest into an internal preservation system and not for public sharing (so minids with a public lookup are a conundrum). It would be great for my use case for a somewhat empty bag to be thrown into the system as a .tgz and for the system to then fetch all the non-massive files. So .json yes, but mp4 no.

mikedarcy commented 6 years ago

Being able to specify wildcard patterns sounds like a good compromise between utility and scope creep.

One relatively simple way to implement this would be using Python's fnmatch library, which would give you the same syntax as Unix shell commands like mv and cp. Supporting a disjunctive array of filters would also be nice so that you could include multiple file types via extension, e.g., [*.txt, *.json], etc. Unfortunately, expressing anything more complex (like negation) is a bit of a pain.

Alternatively, it could be implemented using regular expression matching. That would ultimately be more flexible/powerful but introduces more complexity. However, in this case an input array of filters is not really necessary since any disjunctive conditions could be expressed in the regex. This approach is probably more future-proof; i.e., I could see things getting to a point where fnmatch wasn't good enough anymore and needed to be switched out for re, while the reverse is less likely.

ddierkes commented 6 years ago

re is certainly more powerful than fnmatch or glob, but the syntax is not not quite as easy. I haven't quite figured out your code enough to contribute cleanly, so it is your choice. I could go back and add docstrings to your classes and functions though if you'd like.

I just noticed your other question.

As a file? Either?

Throwing a file together can be a lot more powerful than a regular expression in some cases. For instance, wget takes a url or a text file full of urls separated by line breaks. If you set it up right, you can wget an awful lot of disparate stuff from one text file.

mikedarcy commented 6 years ago

We have a release pending this week, so it is not likely I will be able to do anything with this for the upcoming release. However, after we get the release cut, I'll create a branch for this and prototype something and we can take it from there.

Regarding docstrings, point noted. That should definitely be done at some point. Feel free to file an issue about that if you like, but a PR is probably not necessary. In the meantime, there is API documentation here which can be used as a reference.

mikedarcy commented 6 years ago

So, for another part of the code (bdbag-utils), I recently needed to implement a simple filter expression mechanism for generating remote-file-manifests from various sources. It occurred to me that the same mechanism could be used to implement partial-fetch, so I have done so in an experimental branch: https://github.com/fair-research/bdbag/tree/partial-fetch.

It's pretty flexible while at the same time tries to keep things simple. There is a new argument --fetch-filter that takes a string of the form: <column><operator><value> where:

With this mechanism you can do various string-based pattern matching on filename and url. Using missing as the mode for --resolve-fetch, you can invoke the command multiple times with a different filter to perform a effective disjunction. For example:

The above commands will get all files ending with ".txt", all files beginning with "README", the exact file "data/change.log", and all urls containing "/requirements/" in the url path.

You can also use length and the integer relation operators to easily limit the size of the files retrieved, for example:

Would this general filter mechanism satisfy your use case? Your feedback is welcome.

Also, I am not opposed to providing a file-based solution to this as well, but I think something like that should be much simpler by design, e.g., a simple newline-delimited list of URLs to cross-reference against URLs in fetch.txt and only download on intersections.

ddierkes commented 6 years ago

I am not quite to the part of my project where I would be implementing a partial fetch. Your solutions sounds good. I'm not sure exactly what "missing" means though.

I'm pretty unclear though on where you store 'remote-file-manifests' in a ro-bag. Is remote-files.json a working file unique to bdbags? JSON is certainly cleaner than the somewhat ugly text files LoC invented, but I feel weird adding arbitrary metadata files that aren't clearly specified somewhere.

mikedarcy commented 6 years ago

For --resolve-fetch, the all and missing keywords just specify how to handle files that may or may not have already been fetched. With all, everything is downloaded again, regardless of whether it already exists in the bag payload or not, where missing just gets files that are not already in the payload. In the context of a multi-command fetch with filters, using missing just helps so that files are not re-downloaded in the case of filters that overlap and include the same content more than once.

The remote-file-manifest is just a bdbag-specific format (or working file, as you write) for generating bags with remote and/or local payloads. It does not get added to the bag in any way, it is simply a metadata "driver" file for creation of bags.

The bagit-ro support is completely optional (and doesn't have anything to do with partial-fetch), and you do not need to use it if you feel it is unnecessary for your use case. If you are looking at the entire diff of this branch against master you are seeing other changes unrelated to partial-fetch, but intended to be included in the next software release.

mikedarcy commented 6 years ago

I've included this functionality in the latest release. I'd like to close the issue if there are no more comments.