Open gukejun1 opened 1 year ago
@gukejun1 can you provide more info about your env? how and where did you install merlin libraries? Are you using a docker image? if yes, which docker image? thanks.
the code is same from this, I use docker images(nvcr.io/nvidia/merlin/merlin-tensorflow 22.12)
@rnyak this is my full code
BASE_DIR = os.environ.get("BASE_DIR", "/raid/data/criteo")
INPUT_DATA_DIR = os.environ.get("INPUT_DATA_DIR", BASE_DIR + "/converted/criteo")
OUTPUT_DATA_DIR = os.environ.get("OUTPUT_DATA_DIR", BASE_DIR + "/test_dask/output")
USE_HUGECTR = bool(os.environ.get("USE_HUGECTR", ""))
print(USE_HUGECTR)
stats_path = os.path.join(OUTPUT_DATA_DIR, "test_dask/stats")
dask_workdir = os.path.join(OUTPUT_DATA_DIR, "test_dask/workdir")
# Make sure we have a clean worker space for Dask
if os.path.isdir(dask_workdir):
shutil.rmtree(dask_workdir)
os.makedirs(dask_workdir)
# Make sure we have a clean stats space for Dask
if os.path.isdir(stats_path):
shutil.rmtree(stats_path)
os.mkdir(stats_path)
# Make sure we have a clean output path
if os.path.isdir(OUTPUT_DATA_DIR):
shutil.rmtree(OUTPUT_DATA_DIR)
os.mkdir(OUTPUT_DATA_DIR)
fname = "day_{}.parquet"
num_days = len(
[i for i in os.listdir(INPUT_DATA_DIR) if re.match(fname.format("[0-9]{1,2}"), i) is not None]
)
train_paths = [os.path.join(INPUT_DATA_DIR, fname.format(day)) for day in range(num_days - 1)]
valid_paths = [
os.path.join(INPUT_DATA_DIR, fname.format(day)) for day in range(num_days - 1, num_days)
]
train_paths="/raid/data/criteo/converted/criteo/day_0_40000000.parquet"
valid_paths="/raid/data/criteo/converted/criteo/day_1_4000000.parquet"
print(train_paths)
print(valid_paths)
# Dask dashboard
dashboard_port = "8787"
protocol = "tcp" # "tcp" or "ucx"
if numba.cuda.is_available():
NUM_GPUS = list(range(len(numba.cuda.gpus)))
else:
NUM_GPUS = []
visible_devices = ",".join([str(n) for n in NUM_GPUS]) # Select devices to place workers
device_limit_frac = 0.7 # Spill GPU-Worker memory to host at this limit.
device_pool_frac = 0.8
part_mem_frac = 0.15
# Use total device size to calculate args.device_limit_frac
device_size = device_mem_size(kind="total")
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
part_size = int(part_mem_frac * device_size)
# Check if any device memory is already occupied
for dev in visible_devices.split(","):
fmem = pynvml_mem_size(kind="free", index=int(dev))
used = (device_size - fmem) / 1e9
if used > 1.0:
warnings.warn(f"BEWARE - {used} GB is already occupied on device {int(dev)}!")
cluster = None # (Optional) Specify existing scheduler port
if cluster is None:
cluster = LocalCUDACluster(
protocol=protocol,
n_workers=len(visible_devices.split(",")),
CUDA_VISIBLE_DEVICES=visible_devices,
device_memory_limit=device_limit,
local_directory=dask_workdir,
dashboard_address=":" + dashboard_port,
rmm_pool_size=(device_pool_size // 256) * 256
)
# Create the distributed client
client = Client(cluster)
print(client)
# define our dataset schema
CONTINUOUS_COLUMNS = ["I" + str(x) for x in range(1, 14)]
CATEGORICAL_COLUMNS = ["C" + str(x) for x in range(1, 27)]
LABEL_COLUMNS = ["label"]
COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS + LABEL_COLUMNS
num_buckets = 10000000
categorify_op = Categorify(out_path=stats_path, max_size=num_buckets, dtype='int32')
# categorify_op = Categorify(out_path=stats_path, max_size=num_buckets, dtype=np.zeros(0))
cat_features = CATEGORICAL_COLUMNS >> categorify_op
cont_features = CONTINUOUS_COLUMNS >> FillMissing() >> Clip(min_value=0) >> Normalize(out_dtype='float32')
# cont_features = CONTINUOUS_COLUMNS >> FillMissing() >> Clip(min_value=0) >> Normalize(out_dtype=np.zeros(0))
label_features = LABEL_COLUMNS >> AddMetadata(
tags=[str(Tags.BINARY_CLASSIFICATION), "target"]
)
features = cat_features + cont_features + label_features
workflow = nvt.Workflow(features)
dict_dtypes = {}
# The environment variable USE_HUGECTR defines, if we want to use the output for HugeCTR or another framework
for col in CATEGORICAL_COLUMNS:
dict_dtypes[col] = np.int64 if USE_HUGECTR else np.int32
for col in CONTINUOUS_COLUMNS:
dict_dtypes[col] = np.float32
for col in LABEL_COLUMNS:
dict_dtypes[col] = np.int32
print(dict_dtypes)
train_dataset = nvt.Dataset(train_paths, engine="parquet", part_size=part_size,
)
valid_dataset = nvt.Dataset(valid_paths, engine="parquet", part_size=part_size,
)
output_train_dir = os.path.join(OUTPUT_DATA_DIR, "train/")
output_valid_dir = os.path.join(OUTPUT_DATA_DIR, "valid/")
# ! mkdir -p $output_train_dir
# ! mkdir -p $output_valid_dir
print(workflow)
workflow.fit(train_dataset)
# train_dataset.fillna(0, inplace=True)
workflow.transform(train_dataset).to_parquet(
output_files=len(NUM_GPUS),
output_path=output_train_dir,
shuffle=nvt.io.Shuffle.PER_PARTITION,
dtypes=dict_dtypes,
cats=CATEGORICAL_COLUMNS,
conts=CONTINUOUS_COLUMNS,
labels=LABEL_COLUMNS,
)
workflow.transform(valid_dataset).to_parquet(
output_path=output_valid_dir,
dtypes=dict_dtypes,
cats=CATEGORICAL_COLUMNS,
conts=CONTINUOUS_COLUMNS,
labels=LABEL_COLUMNS,
)
workflow.save(os.path.join(OUTPUT_DATA_DIR, "workflow"))
I install merlin libraries from the web of https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow
@rnyak The training data is from the first 40 million rows of day_0 in the criteo data set, and the verification data is from the first 4 million rows of day_1.The following figure shows some parquet data visualization.
@gukejun1 if you have null values, normally, when you apply the following lines in the NVT workflow the missing/null values should be filled..
cat_features = CATEGORICAL_COLUMNS >> categorify_op
cont_features = CONTINUOUS_COLUMNS >> FillMissing() >> Clip(min_value=0) >> Normalize(out_dtype='float32')
can you share a subset of your parquet file like only couple hundreds rows, so that we can reproduce the issue? thanks.
@rnyak day_1_100.parquet.txt The data comes from the first 100 lines of data in criteo day_1 and is converted to the parquet file using the official method \Merlin\examples\scaling-criteo\01_download_convert.ipynb.
@gukejun1 I used your small dataset with this notebook and all worked fine for me. I cannot reproduce your error.. are you able to reproduce your error only with this small parquet file?
@rnyak Very strange.
@gukejun1 please note that your screenshot shows that you are trying to read in a .parquet.txt
file, not a .parquet
file, that means your files extension type is not correct. it should be .parquet
.
@rnyak It's the same. The only difference is that the file name extension is in the parquet format. Because GitHub cannot upload files with the parquet file name extension, the file name extension is changed to txt.
@rnyak So, this code didn't work.
@rnyak Because my graphics card supports up to cuda 11.3, so I reinstalled cupy-cuda to 113. Is it related to this? Does cupy-cuda 113 support populating missing values?
@gukejun1 what's your graphic card?
your sample set does not have any nulls in the label
column. so I am skeptical that this line gives you error. you can remove this line and test it. you can do like below.
CONTINUOUS_COLUMNS = ["I" + str(x) for x in range(1, 14)]
CATEGORICAL_COLUMNS = ["C" + str(x) for x in range(1, 27)]
LABEL_COLUMNS = ["label"]
COLUMNS = CONTINUOUS_COLUMNS + CATEGORICAL_COLUMNS + LABEL_COLUMNS
num_buckets = 10000000
categorify_op = Categorify(out_path=stats_path, max_size=num_buckets, dtype='int32')
cat_features = CATEGORICAL_COLUMNS >> categorify_op
cont_features = CONTINUOUS_COLUMNS >> FillMissing() >> Clip(min_value=0) >> Normalize(out_dtype='float32')
features = cat_features + cont_features
workflow = nvt.Workflow(features)
...
...
@rnyak The error is still reported. my graphic card is NVIDIA Tesla P4
@gukejun1 cudf supports Pascal architecture or better (Compute Capability >=6.0)
. see this doc.
can you test if you are able to run the notebooks 01
and 02
in this folder?
pip list
output in a .txt file please? nvidia-smi
? merlin-tensorflow:22.12
?thanks.
@gukejun1 the error looks like because of pandas, and looks like you are running on CPU not on GPU... Please confirm that the visible devices
from the following code below does not return empty. if it is empty that means you dont use GPU..
protocol = "tcp" # "tcp" or "ucx"
if numba.cuda.is_available():
NUM_GPUS = list(range(len(numba.cuda.gpus)))
else:
NUM_GPUS = []
visible_devices = ",".join([str(n) for n in NUM_GPUS]) # Select devices to place workers
@rnyak For the movie_lens case, 01 / 02 is successful. 1、 requirements.txt 2、
3、i use docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:22.12 to get the docker images.
@rnyak it used GPU
When I run the case in , an error is reported.
Why is the error still reported that nan cannot be converted to an int value? The official website handles the problem of missing values. How to solve this problem?