Closed distora-w3 closed 1 year ago
The dataset and datasource is a one to many mapping, one dataset can have multiple data sources.
The dataset is just a high level grouping of data sources.
The preparation happens on all datasources, though your first data source is already prepared. If you would like to automatically rescan the first data source, you can either 1. specify --rescan-interval
on the datasource and keep dataset worker running. 2. use singularity datasource rescan
to trigger another round of scanning. 3. use the push API to tell singularity about the new file
I get that it is confusing to have this 1:N mapping, the intention was to allow grouping of similar datasources, i.e. every month you have a new datasource common-crawl-2023-Aug
added to the same dataset, and you would like to manage deal making with the same dataset.
I'm also trying to think about a better way to make it less confusing, maybe we can make datasource menu inside dataset, i.e. singularity dataset add-datasource
. Would love your opinions.
Other items from this discussion would be
Thank you. I will think about your comments and provide feedback.
Ok, I have managed to think about this overnight. We can reduce the burden of understanding how the various components of singularity fit if we follow the concepts of other tools the users will be familiar with.
For example sysadmins will be familiar with logical volume management such as lvm, zfs and adm. The approach these are as follows:
In the case of singularity, it means:
singularity datasource create Name --inputdir dir
#name here is very importantsingularity dataset attach Dataset_Name Name_from_above
# attach and detach are the taskssingularity dataset status
singularity dataset enable
(also pause or sync)singularity dataset run Dataset_Name
# worker is a background task, I want my dataset prepared/pausedExample usage and output
1a. singularity datasource create Xinan_db --inputdir /mnt/xinan-server/dbdump
1b. singularity datasource create Xinan_sys --inputdir /mnt/xinan-server/sysdump
2. singularity dataset attach XinanSet Xinan_db Xinan_sys
3. singularity dataset status XinanSet --size
Dataset Source Satus LastScan Size
===============================================================
XinanSet Xinan_db paused 20230815.... 1000 TiB
XinanSet Xinan_db ready --never --- 500 TiB
4. singularity dataset enable Xinan_db #user has granular control
5. singularity dataset run XinanSet
6. singularity dataset status XinanSet --progress
Dataset Source Satus LastScan Size Progress
======================================================================================
XinanSet Xinan_db running ---- 1000 TiB 50% complete
XinanSet Xinan_db running ---- 500 TiB 1% complete
Another issue to think about:
lotus-miner
, lotus-worker
and lotus for admin eg lotus daemon
The major concern of this issue is addressed by #221
Description
Steps to Reproduce
Incorrect relationship datasource and dataset can produce empty dag file (see folder 01-test-out-a and 9d55c1f3-b8cd-4155-9371-2fad030d059c.car)
Observation: Dataset has two --output dir (see filepath column)
Version
singularity v0.2.47-1f0d416-dirty
Operating System
Linux
Database Backend
SQLite
Additional context
As part of testin #90