data-preservation-programs / singularity

Tool for onboarding data to the Filecoin Network
Other
18 stars 15 forks source link

[Bug]: "One to many", "many to one" or "one to one" for dataset relationship to datasource - zero length temp dag file #204

Closed distora-w3 closed 1 year ago

distora-w3 commented 1 year ago

Description

Problem 1: Temporary zero-length dag file in the output directory

drwxrwxr-x 2 fc fc 4096 Aug 14 09:40 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc    0 Aug 14 08:59 9d55c1f3-b8cd-4155-9371-2fad030d059c.car
-rw-rw-r-- 1 fc fc  114 Aug 14 09:40 baga6....5qi4tsazlqbq.car

Problem 2: What is the agreed relationship between data source and dataset?
Which of these is correct?
 - data source can have many datasets?
 - data source can only have one dataset?
 - dataset can have many data sources?

Context:
I was able to make a mistake by adding a new file to the incorrect input folder and created a zero-length dag file.

This might not be a problem, but a consequence of not understanding the data source to dataset relationship. In any case, the zero-length temporary files should not exist after the process has terminated.

The example in the "steps to reproduce" section below, shows a dataset with two data sources. 

Related issue [1]:
If a dataset can have multiple data sources, then I expect the preparation to occur on the dataset, which means the dataset worker would scan all folders for the dataset AND data sources. (From the test, it appears we only scan only the data source referenced items. (new item in data source 1 is missed and requires a manual scan) 

Related issue [2]
CLI tool needed to print out dataset and datasource relationship
eg 
$ singularity dataset schema
Dataset    Datasource
01-test    
-------------[name 1]
-------------[name 2]

Steps to Reproduce

Incorrect relationship datasource and dataset can produce empty dag file (see folder 01-test-out-a and 9d55c1f3-b8cd-4155-9371-2fad030d059c.car)

01-test-out:
total 16
drwxrwxr-x 2 fc fc 4096 Aug 14 08:59 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc  334 Aug 14 08:59 baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy.car
-rw-rw-r-- 1 fc fc  371 Aug 14 08:59 baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq.car

01-test-out-a:
total 12
drwxrwxr-x 2 fc fc 4096 Aug 14 09:40 .
drwxrwxr-x 6 fc fc 4096 Aug 14 08:29 ..
-rw-rw-r-- 1 fc fc    0 Aug 14 08:59 9d55c1f3-b8cd-4155-9371-2fad030d059c.car
-rw-rw-r-- 1 fc fc  114 Aug 14 09:40 baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq.car

Observation: Dataset has two --output dir (see filepath column)

 singularity dataset list-pieces 01-test
2023-08-14T10:03:17.647Z        INFO    database        database/connstring_cgo.go:26   Opening sqlite database (cgo version)
ID  CreatedAt             PieceCID                                                          PieceSize    RootCID                                                      FileSize  FilePath                                                                                                         DatasetID  SourceID  ChunkID
1   2023-08-14 08:59:40Z  baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy  34359738368  bafkreig6kienkbdckf6dmt7shmtq3yocnhjwbl2ki3g3wyzogyq5oprftq  334       /mnt/blockstorage/testround2/01-test-out/baga6ea4seaqiwnjf2ydmouvqhovhnismpsokdttmv3tiaipdkcafhl4qyndjscy.car    1          1         1
2   2023-08-14 08:59:40Z  baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq  34359738368  bafkreidqt6tqn52ubergv3gjl7kuhh4xw5q4m2ao2zz2ki4j3yhrl4zuxy  371       /mnt/blockstorage/testround2/01-test-out/baga6ea4seaqnaqlh4koxgqetwjnzztxnmpdn3msyslcq3l7mfwkullew6gnkenq.car    1          1
3   2023-08-14 09:40:26Z  baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq  34359738368  bafkreih6xy24gufltqg7rewebsjzuf7sx7zpmh4n7xaicha66p3hiavogu  114       /mnt/blockstorage/testround2/01-test-out-a/baga6ea4seaqdqljklw2naa5her4j2bt6otjg7zj3ijlwkrbpzs35qi4tsazlqbq.car  1          1         2
[START] ==============##### This Seems to work ####
# prep
# reset database & delete file
rm /mnt/blockstorage/testround2/01-test*/*
singularity admin reset --really-do-it

# create test files in shell
cd /mnt/blockstorage/testround2/01-test-in
for i in {1..5}; do echo "Hello from file ${i}" > "hello${i}.txt"; done

# check empty
singularity dataset list
singularity datasource list

# create a dataset
singularity dataset create  --output-dir /mnt/blockstorage/testround2/01-test-out 01-test
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 1
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource inspect dags 1

# Now update -----------------
mkdir /mnt/blockstorage/testround2/01-test-out-a
singularity dataset update  --output-dir /mnt/blockstorage/testround2/01-test-out-a 01-test
singularity run dataset-worker --exit-on-complete --exit-on-error

# >>>   add a new file to datasource 2  <<<<<
mkdir /mnt/blockstorage/testround2/01-test-in-a
cd /mnt/blockstorage/testround2/01-test-in-a        # directory ok
echo "Hello from file 6" > hello6.txt
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in-a
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 2
singularity run dataset-worker --exit-on-complete --exit-on-error
# check to see what has occured
[START] ==============##### This Does NOT seem to work ####

== PREP ====
# reset database & delete files

rm /mnt/blockstorage/testround2/01-test*/*
singularity admin reset --really-do-it

# create test files in shell
cd /mnt/blockstorage/testround2/01-test-in
for i in {1..5}; do echo "Hello from file ${i}" > "hello${i}.txt"; done

# check empty
singularity dataset list
singularity datasource list

# create a dataset
singularity dataset create  --output-dir /mnt/blockstorage/testround2/01-test-out 01-test
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 1
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource inspect dags 1

# Now update -----------------
mkdir /mnt/blockstorage/testround2/01-test-out-a
singularity dataset update  --output-dir /mnt/blockstorage/testround2/01-test-out-a 01-test
singularity run dataset-worker --exit-on-complete --exit-on-error

# >>> add a new file to datasource 1  <<<<<<<<<
mkdir /mnt/blockstorage/testround2/01-test-in-a
cd /mnt/blockstorage/testround2/01-test-in           # directory not ok
echo "Hello from file 6" > hello6.txt
singularity datasource add local 01-test /mnt/blockstorage/testround2/01-test-in-a
singularity run dataset-worker --exit-on-complete --exit-on-error
singularity datasource daggen 2
singularity run dataset-worker --exit-on-complete --exit-on-error
# check to see what has occured
# new file is not scanned
# zero length temp dag file created

Version

singularity v0.2.47-1f0d416-dirty

Operating System

Linux

Database Backend

SQLite

Additional context

As part of testin #90

xinaxu commented 1 year ago

The dataset and datasource is a one to many mapping, one dataset can have multiple data sources. The dataset is just a high level grouping of data sources. The preparation happens on all datasources, though your first data source is already prepared. If you would like to automatically rescan the first data source, you can either 1. specify --rescan-interval on the datasource and keep dataset worker running. 2. use singularity datasource rescan to trigger another round of scanning. 3. use the push API to tell singularity about the new file I get that it is confusing to have this 1:N mapping, the intention was to allow grouping of similar datasources, i.e. every month you have a new datasource common-crawl-2023-Aug added to the same dataset, and you would like to manage deal making with the same dataset. I'm also trying to think about a better way to make it less confusing, maybe we can make datasource menu inside dataset, i.e. singularity dataset add-datasource. Would love your opinions.

Other items from this discussion would be

  1. list the dataset - datasource mapping
  2. investigate the empty dag - I assume it is produced by an empty data source
distora-w3 commented 1 year ago

Thank you. I will think about your comments and provide feedback.

  1. Yes
  2. Yes - I think this is because the data source is empty in this instance.
distora-w3 commented 1 year ago

Ok, I have managed to think about this overnight. We can reduce the burden of understanding how the various components of singularity fit if we follow the concepts of other tools the users will be familiar with.

For example sysadmins will be familiar with logical volume management such as lvm, zfs and adm. The approach these are as follows:

In the case of singularity, it means:

  1. singularity datasource create Name --inputdir dir #name here is very important
  2. singularity dataset attach Dataset_Name Name_from_above # attach and detach are the tasks
  3. singularity dataset status
  4. singularity dataset enable (also pause or sync)
  5. singularity dataset run Dataset_Name # worker is a background task, I want my dataset prepared/paused

Example usage and output

1a. singularity datasource create Xinan_db --inputdir /mnt/xinan-server/dbdump
1b. singularity datasource create Xinan_sys --inputdir /mnt/xinan-server/sysdump

2. singularity dataset attach XinanSet Xinan_db Xinan_sys

3. singularity dataset status XinanSet --size

Dataset    Source     Satus     LastScan                Size
===============================================================
XinanSet   Xinan_db   paused   20230815....        1000 TiB
XinanSet   Xinan_db   ready    --never ---           500 TiB

4. singularity dataset enable Xinan_db #user has granular control

5. singularity dataset run XinanSet


6. singularity dataset status XinanSet --progress

Dataset    Source     Satus     LastScan                Size                Progress
======================================================================================
XinanSet   Xinan_db   running    ----              1000 TiB             50% complete
XinanSet   Xinan_db   running    ----                500 TiB            1%   complete

Another issue to think about:

xinaxu commented 1 year ago

The major concern of this issue is addressed by #221