StatCan / aaw-kubeflow-containers

Containers built to be used with Kubeflow for Data Science
Other
24 stars 21 forks source link

Feature: paperless for managing messy datasets #562

Closed bryanpaget closed 4 months ago

bryanpaget commented 9 months ago

We've all been there, you downloaded a large multimedia dataset and now you feel overwhelmed with managing it as a whole and perplexed with how to approach finding needles in your haystack. Paperless is an excellent document management app, it does OCR and has lots of metadata editing features. It's great for managing datasets with documents (PDF, Word, etc) and scans of documents containing text.

An alternative to paperless is:

Since this requires docker (like Grist), we would want to host paperless with Kubeflow. The use-case would be someone has a giant, messy, multi-media dataset and they create a new Kubeflow server running paperless with an attached shared disk. They load their data into paperless (which stores the data on the shared disk) and they can then access, manage, sort, modify etc the data and it would be accessible from other Kubeflow notebook servers via the shared disk.

The problem being solved here is powerful data management on the AAW for someone wanting to do data analysis or machine learning with a messy dataset.

Souheil-Yazji commented 4 months ago

Push to zone users instead.