Closed jbarrow closed 2 years ago
Oh, also, I should say that your dataset proposal looks nice. I imagine each dataset might simply have a different top-level directory, say:
skiff_files/datasets/my-dataset
skiff_files/datasets/my-other-dataset
Then we could use a symlink to change the dataset that the application is pointed at, or wire something into the UI for doing so (which simply changes the path that's being read from). The latter route has one little bit of complexity, in that we'd need to protect against attacks that use path modifiers like .../
to read files that the application shouldn't have access to (like /etc/passwd
! 😅 ). But this probably wouldn't be too hard to figure out.
I'd say feel free to go about adding this, as I can't think of any reason not to!
Alright, added a --no-hash
argument and updated the README.
I'll create a separate pull request for the full dataset implementation/discussion once this one's merged.
One thing I would appreciate in Pawls (and am happy to flesh out, if there's any interest) is an extended CLI that can manage datasets. As a quick first pass, I added a command that takes in a PDF or directory of PDFs and copies them into the
skiff_files
folder.It hashes the PDFs and copies them into
skiff_files
.The rest is outside the scope of this pull request, but in general, I was thinking of a set of commands, to create a dataset:
Add pdfs to the dataset:
And offer per-dataset configuration for the label-set.