DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

How to import a file #11

Closed caro401 closed 7 months ago

caro401 commented 7 months ago

Don't merge yet! This is work in progress until #10 is resolved, and the code updated to reflect those decisions.

This is an attempt to write up the discussion from #9 in a how-to style format, hopefully accessible to a Python-curious historian, but meaningful to a Python expert wanting to just achieve the result.

I'm looking for feedback from someone on the team about how well this reads as the target audience, and from @makkus as to whether it's technically correct code, comments and prose.

I put the code just in markdown code blocks as it won't actually run correctly unless the example relative file path exists. Is that OK for now, or is a fully functional example in Jupyter or other format preferable (see also #4 )

makkus commented 7 months ago

Ah, we might also want to have a section about 'file_bundles' (basically folders, but can also be archives or anything else where we have more than one file that belong together in an important way). Not sure if that needs to be referenced here, but personally I'd probably be curious what to do if I have that instead of a single file.

caro401 commented 7 months ago

two important concepts that need to be explained (possibly somewhere else)

yes, I think we need to get a lot of clarity on what the store is, how it works, why it exists, and what onboarding/import means on a technical level. I don't think that content belongs here (although a link would probably be useful), as I imagine these bits of docs as short things you come to when you have a specific problem, and you need a specific answer to get your work done. I'll open a discussion issue and write up what I understand about the store and aliases, but my knowledge is very incomplete

we might also want to have a section about 'file_bundles'

Sure, I can add that, but don't know the answer. What is a file bundle, when would you use that rather than just importing lots of files individually? Is there anything you can do with a file bundle you can't do with a file or vice versa?

makkus commented 7 months ago

What is a file bundle

A data type that contains one or several files, each identified by an internal (relative) sub-path within the bundle. The contained files are usually related in some way that is relevant to the computations that will be done on them (for example multiiple text files belonging to the same corpus)

when would you use that rather than just importing lots of files individually

whenever you have files that have that shared context, and would be fed into a downstream operation at the same time. Otherwise the downstream operation would need to have an input field for every individual file, which would be inefficient and only possible if you know exactly how many (sub-) files you will be dealing with.

Is there anything you can do with a file bundle you can't do with a file or vice versa?

Technically not I guess, but the question really is what operation would make sense for a single file that also makes sense for a file bundle. The only thing I can think of is doing the same operation on every sub-file of a bundle, which would be very inefficient and painful to have to do manually, so it'd be nice to have a module that can take a file-bundle and does that operation for all included files. But we haven't had a use-case like that so far, if I remember right.

For kiaras purposes, a file and a file_bundle are 2 different data types, and a module that takes one as input can't be used with the other. You'd have to use a 'pick.file' operation on a file bundle first, for example, if you have a single file input in an operation you want to use. Or you'd have to 'augment' a single file with an internal relative-path (which basically means adding information to data) if you wanted to convert a single file to a file_bundle (but that's not something we had to do so far I think).

caro401 commented 7 months ago

Are there currently any examples or user stories of using a file bundle? are there any operations that currently use them? if not, I'll pull this file-bundle discussion into a low-priority issue for now, clean up the single file docs and move on

makkus commented 7 months ago

Are there currently any examples or user stories of using a file bundle

If I remember right it's an important topic in language analysis, since there you usually have loads of separate text files.

makkus commented 7 months ago

Ah, also, often when you import from an external archive service, they deliver a zip file which kiara would treat as a bundle, and then pick the single file in it. So it's part of pipelines in that way.

caro401 commented 7 months ago

@makkus is this acceptable enough to merge, or what specific changes would you like? I've separated out the discussion of file_bundles and the store into separate issues #12 and #13