DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

How do I import a bundle of files into Kiara via the python API? #13

Open caro401 opened 7 months ago

caro401 commented 7 months ago

marked as low-priority, since I don't think we have a pressing usecase for this right now, but capturing discussion from #11 , quoting @makkus

What is a file bundle

A data type that contains one or several files, each identified by an internal (relative) sub-path within the bundle. The contained files are usually related in some way that is relevant to the computations that will be done on them (for example multiple text files belonging to the same corpus)

when would you use that rather than just importing lots of files individually

whenever you have files that have that shared context, and would be fed into a downstream operation at the same time. Otherwise the downstream operation would need to have an input field for every individual file, which would be inefficient and only possible if you know exactly how many (sub-) files you will be dealing with.

Is there anything you can do with a file bundle you can't do with a file or vice versa?

Technically not I guess, but the question really is what operation would make sense for a single file that also makes sense for a file bundle. The only thing I can think of is doing the same operation on every sub-file of a bundle, which would be very inefficient and painful to have to do manually, so it'd be nice to have a module that can take a file-bundle and does that operation for all included files. But we haven't had a use-case like that so far, if I remember right.

For kiaras purposes, a file and a file_bundle are 2 different data types, and a module that takes one as input can't be used with the other. You'd have to use a 'pick.file' operation on a file bundle first, for example, if you have a single file input in an operation you want to use. Or you'd have to 'augment' a single file with an internal relative-path (which basically means adding information to data) if you wanted to convert a single file to a file_bundle (but that's not something we had to do so far I think).

makkus commented 7 months ago
from kiara.api import KiaraAPI
from kiara.models.filesystem import KiaraFile, KiaraFileBundle

api = KiaraAPI.instance()

inputs = {
    "path": "/home/markus/projects/kiara/kiara/src/"
}
results = api.run_job("import.local.file_bundle", inputs=inputs)
bundle = results["file_bundle"]

bundle_data: KiaraFileBundle = bundle.data
print(bundle_data.included_files.keys())

inputs = {
    "file_bundle": bundle,
    "path": "kiara/version.txt"
}
results = api.run_job("file_bundle.pick.file", inputs=inputs)
file = results["file"]

data: KiaraFile = file.data
print(data.read_text())

Rewrite according to whatever example code standards you choose. I included some extra type hints in case you want to explain how to acess the actual data in Python, and the modules they live in. Just remove those lines/hints if appropriate.

makkus commented 7 months ago

Discuss when you should use file bundle instead of multiple files, what the tradeoffs are. Eg why the network analysis examples have nodes and edges CSVs but usually import them separately (is this wrong???)

It really is usually fairly obvious when you design a module which one you need, I haven't had to think about it really. In some cases it would maybe be beneficial to offer both input types, but that would mean module proliferation. So not sure. It just seems to make sense to import nodes and edges seperately, but maybe my design is wrong here, happy to change that module if necessary.

caro401 commented 7 months ago

https://github.com/DHARPA-Project/kiara-website/pull/11#issuecomment-1814447518 - so can you not import a zip file via import.local.file? Do you have to use file bundle? Is the same true for other archive formats (tar etc)?