Leaky abstraction in `OpenPechFS`'s `path` arg

10zinten commented 1 year ago

How it's leaky abstraction?

output_path is the path to store all the pechas created by the formatters. It defaults to ~/.openpecha/pechas/

But in OCRFormatter's create_opf, the output_path is passed as path of OpenPecha which is opf_path

now, the caller code, eg OCR-pipelines, needs to create opf_path for pecha, which in turn requires to create pecha_id. But the pecha_id generation is handled by Metadata, which is only created in the Formatters. So, with currently implementation, pecha will be saved at opf_path created by caller code. Since, it's doesn't have access to Metadata creation, metadata will generate new pecha_id. Now, we ended up with different, pecha_id in opf_path and `meta.yml.

Therefore, I think this is a leaky abstraction. The caller code only needs to provide where to store the all the pechas to the Formatters.

Why this problem exists?

I think, there is two scenarios, creating and loading pecha and we don't have clean way to handle these two.

Solution

1. When Creating new pechas

We should initialise the pecha object with the actual data like, base, layers, metadata, etc. When saving, we should provide output_path which is the parent path of the pecha path like {output_path}/{pecha_id}/{pecha_id}.opf. Now the output_path is configurable, which is desired behaviour.

class OpenPechaFS(OpenPecha):
    ...
    @property
    def opf_path(self):
        return self.base_path / self.pecha_id / f"{self.pecha_id}.opf"

    def save(output_path: Path) -> Path:
        self.base_path = output_path
        ...
        return self.opf_path

2. When Loading existing pechas

I think we should go with classmethods, for eg:

pecha = OpenPechaFS.from_path(<path_to_pecha>)  # loads local pecha

pecha = OpenPechaFS.from_id(<pecha_id>)  # downloads pecha from github

pecha = OpenPechaGitRepo.from_commit_sha(<commit_sha>)  # loads pecha from specific commit sha

eroux commented 1 year ago

I'm not sure about the first idea, I think you're wanting to make OpenPechaFS a very high level class, but I think it's not a good idea. I think we need a class that interacts with a local .opf folder and that's it, I think the current state of OpenPechaFS is just that so I wouldn't change it.

Now, I also agree we need a high-level class that reserves an ID, create the repo on github, downloads it, etc. and that's OpenPechaGitRepo. So I would instead do:

pecha = OpenPechaFS(<path_to_pecha>) # no change

pecha = OpenPechaGitRepo(base_path, <pecha_id>=None, <rev>="main")  # downloads pecha from github if it exists, otherwise create one with the id. If pecha_id is None, reserves a new id

Note that if we add the rev argument, the path on disk should be changed to self.base_path / self.pecha_id + "_" + self.rev / self.pecha_id + ".opf". Otherwise you're going to get bugs when opening one pecha at two different revisions (since they'll share the same folder). I think it's a good thing overall, good idea!

Also, I think the creation of a new openpecha id should not be hidden deep in the Metadata generation, it should be a static method of OpenPechaGitRepo that anyone can use:

pecha_id =  OpenPechaGitRepo.reserve_new_id() # reserves a new id that is not currently in use on github

Note that in ocr_formatter you can then use OpenPechaGitRepo.reserve_new_id() to pass the pecha_id argument to create_opf, I think that's a small thing you can do that should solve your current issue

I think having output_path in the __init__ of OCRFormatter is not ideal, it would be better to have opf_path as an argument of create_opf. That way the OCRFormatter doesn't need to deal with creating a new ID, cloning github repos, etc. which I think shouldn't be its job, it's just there to write some files in an opf.

And finally, I think it would be much cleaner if we had a method create_in_op method in OCRFormatter. That way everything is abstracted correctly, we already have our OpenPecha object in the function, it can be of any subclass of OpenPecha, that's the cleanest solution.

What do you think?

10zinten commented 1 year ago

it would be better to have opf_path as an argument of create_opf. That way the OCRFormatter doesn't need to deal with creating a new ID.

so, that means, the caller/client code using the Formatter has to make prepare opf_path?

OpenPecha / Toolkit