NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

Broken Tutorial: single_gpu_tutorial.ipynb #105

Open axelmagn opened 2 weeks ago

axelmagn commented 2 weeks ago

tutorials/single_node_tutorial/single_gpu_tutorial.ipynb section 5.3 contains a line:

# Read .jsonl input data
ddf_text = get_text_ddf_from_json_path_with_blocksize(
    input_data_paths=input_data_paths,
    num_files=num_files,
    blocksize=text_ddf_blocksize,
    id_column=input_id_field,
    text_column=input_text_field,
)

which produces the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[48], line 5
      2 num_workers = get_num_workers(client)
      4 # Read .jsonl input data
----> 5 ddf_text = get_text_ddf_from_json_path_with_blocksize(
      6     input_data_paths=input_data_paths,
      7     num_files=num_files,
      8     blocksize=text_ddf_blocksize,
      9     id_column=input_id_field,
     10     text_column=input_text_field,
     11 )
     12 # Read "_buckets.parquet"
     13 ddf_bk = get_bucket_ddf_from_parquet_path(input_bucket_path=input_bucket_path, num_workers=num_workers)

TypeError: get_text_ddf_from_json_path_with_blocksize() missing 1 required positional argument: 'input_meta'

I've checked the corresponding function, and it does indeed require a input_meta argument. However it's missing a docstring and I'm not sure what form this argument should take.

I suspect that the signature of this function has drifted since the tutorial was written. Maybe consider giving input_meta a default value (if possible), or if not, writing some docstrings for these functions?

ayushdg commented 2 weeks ago

Thanks for opening @axelmagn. You're indeed right that the function has drifted since the tutorial was initially written. In this case I'm in favor of having a default for input_meta set to None for this method.

cc: @miguelusque if you have any opinions.

miguelusque commented 2 weeks ago

Hi @ayushdg , I think you are right. Setting it to None should make it work.

I will change that method to have a default value of None to avoid the happening in the future. Apologies for the inconvenience.