NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

ais etl can't fetch data from the bucket #145

Closed yingca1 closed 1 year ago

yingca1 commented 1 year ago
ais create ais://data_bucket_1

ais bucket props set ais://data_bucket_1 backend_bck=gs://dataset_raw

update:

if I do

then

will successful process 00081, but can't get the other files to work either.

yingca1 commented 1 year ago

It seems like ais etl can only fetch files that have already been cached in the bucket

gaikwadabhishek commented 1 year ago

Hey @yingca1! can you directly try processing from the remote bucket? e.g. ais etl bucket transformer-etl <etl-name> gs://dataset_raw ais://dst.

Also, check if there are any logs/errors from your previous transformation - ais etl logs <etl-name>

yingca1 commented 1 year ago
ais ls gs://dataset_raw
NAME             SIZE            CACHED  
00001.tar        139.97MiB       yes     
00002.tar        150.64MiB       no      
00003.tar        155.62MiB       no      
00004.tar        164.40MiB       no      
00005.tar        148.05MiB       no      
00006.tar        167.46MiB       no      
00007.tar        155.25MiB       no      
00008.tar        148.16MiB       no      
00009.tar        148.16MiB       no      
00010.tar        174.16MiB       no      
00011.tar        157.26MiB       no      
00012.tar        126.54MiB       no      
00013.tar        140.54MiB       no      
00014.tar        166.49MiB       no      
00015.tar        155.99MiB       no      
00016.tar        139.45MiB       no      
00017.tar        166.50MiB       no      
00018.tar        141.96MiB       no      
00019.tar        143.59MiB       no      
00020.tar        150.62MiB       no      
00021.tar        152.93MiB       no      
00022.tar        152.81MiB       no      
00023.tar        128.04MiB       no      
00024.tar        146.64MiB       no      
00025.tar        157.93MiB       no      
00026.tar        150.59MiB       no      
00027.tar        136.73MiB       no      
00028.tar        151.62MiB       no      
00029.tar        151.45MiB       no      
00030.tar        156.35MiB       no      
00031.tar        138.45MiB       no      
00032.tar        136.20MiB       no      
00033.tar        143.92MiB       no      
00034.tar        159.80MiB       no      
00035.tar        134.40MiB       no      
00036.tar        177.63MiB       no      
00037.tar        151.78MiB       no      
00038.tar        153.73MiB       no      
00039.tar        160.19MiB       no      
00040.tar        139.48MiB       no      
00041.tar        136.02MiB       no      
00042.tar        150.70MiB       no      
00043.tar        131.01MiB       no      
00044.tar        140.57MiB       no      
00045.tar        151.36MiB       no      
00046.tar        153.03MiB       no      
00047.tar        142.15MiB       no      
00048.tar        149.41MiB       no      
00049.tar        138.68MiB       no      
00050.tar        157.70MiB       no      
00051.tar        135.21MiB       no      
00052.tar        157.94MiB       no      
00053.tar        148.85MiB       no      
00054.tar        165.08MiB       no      
00055.tar        146.65MiB       no      
00056.tar        159.91MiB       no      
00057.tar        123.22MiB       no      
00058.tar        139.02MiB       no      
00059.tar        153.07MiB       no      
00060.tar        150.39MiB       no      
00061.tar        141.47MiB       no      
00062.tar        162.76MiB       no      
00063.tar        137.81MiB       no      
00064.tar        144.43MiB       no      
00065.tar        165.58MiB       no      
00066.tar        148.15MiB       no      
00067.tar        144.23MiB       no      
00068.tar        151.54MiB       no      
00069.tar        151.61MiB       no      
00070.tar        146.01MiB       no      
00071.tar        134.46MiB       no      
00072.tar        145.56MiB       no      
00073.tar        137.06MiB       no      
00074.tar        144.52MiB       no      
00075.tar        151.15MiB       no      
00076.tar        146.14MiB       no      
00077.tar        136.53MiB       no      
00078.tar        145.85MiB       no      
00079.tar        149.72MiB       no      
00080.tar        146.18MiB       no      
00081.tar        150.58MiB       no      
00082.tar        164.97MiB       no      
00083.tar        145.10MiB       no      
00084.tar        145.37MiB       no      
00085.tar        141.30MiB       no      
00086.tar        143.17MiB       no      
00087.tar        143.07MiB       no      
00088.tar        139.75MiB       no      
00089.tar        155.99MiB       no      
00090.tar        151.40MiB       no      
00091.tar        142.84MiB       no      
00092.tar        189.78MiB       no      
00093.tar        190.97MiB       no      
00094.tar        192.47MiB       no      
00095.tar        183.54MiB       no      
00096.tar        207.72MiB       no      
00097.tar        209.75MiB       no      
00098.tar        214.41MiB       no      
00099.tar        208.27MiB       no      
00100.tar        176.06MiB       no  

ais etl bucket transformer-etl gs://dataset_raw ais://out1

ais ls ais://out1
NAME             SIZE            
00001.tar        75.78MiB  

@gaikwadabhishek The result looks the same when directly reading files from GCS.

yingca1 commented 1 year ago
  1. Is there any way to quickly cache all the files in the bucket?
  2. Can we delete the files processed by ETL in a timely and dynamic manner?
  3. Can't ETL fetch data remotely while processing?

update:

  1. ais start download --sync gs://dataset_raw ais://data_bucket_1 https://github.com/NVIDIA/aistore/blob/e13261ec748d80d2fbea8797ce5974e2e6f325e1/docs/downloader.md#example
gaikwadabhishek commented 1 year ago

This blog might help you in your current task (answers 1 and 3). For 2, I suppose we don't have it through aistore yet but you can maintain the files through scripts?

alex-aizman commented 1 year ago

Is there any way to quickly cache all the files in the bucket?

"cache" and "files" may mean different things in different circumstances. But if that's what I think it is then - some easy CLI pointers:

$ ais etl bucket --help | grep all
   --all      transform all objects from a remote bucket including those that are not present (not "cached") in the cluster

and specifically for files (not objects):

$ ais object promote --help

and also:

$ ais start download --help

and more:

$ ais start prefetch --help

Those are some of the supported ways. But the most popular way is - just start running.