allenai / pawls

Software that makes labeling PDFs easy.
https://pawls.apps.allenai.org
Apache License 2.0
380 stars 74 forks source link

Adding a CLI tool for adding PDFs #144

Closed jbarrow closed 2 years ago

jbarrow commented 2 years ago

One thing I would appreciate in Pawls (and am happy to flesh out, if there's any interest) is an extended CLI that can manage datasets. As a quick first pass, I added a command that takes in a PDF or directory of PDFs and copies them into the skiff_files folder.

pawls add [PDF OR FOLDER OF PDFS]

It hashes the PDFs and copies them into skiff_files.


The rest is outside the scope of this pull request, but in general, I was thinking of a set of commands, to create a dataset:

pawls dataset create [DATASET NAME] [INITIAL PDFS]

Add pdfs to the dataset:

pawls dataset add [PDFS]

And offer per-dataset configuration for the label-set.

codeviking commented 2 years ago

Oh, also, I should say that your dataset proposal looks nice. I imagine each dataset might simply have a different top-level directory, say:

skiff_files/datasets/my-dataset
skiff_files/datasets/my-other-dataset

Then we could use a symlink to change the dataset that the application is pointed at, or wire something into the UI for doing so (which simply changes the path that's being read from). The latter route has one little bit of complexity, in that we'd need to protect against attacks that use path modifiers like .../ to read files that the application shouldn't have access to (like /etc/passwd! 😅 ). But this probably wouldn't be too hard to figure out.

I'd say feel free to go about adding this, as I can't think of any reason not to!

jbarrow commented 2 years ago

Alright, added a --no-hash argument and updated the README.

I'll create a separate pull request for the full dataset implementation/discussion once this one's merged.