Building-ML-Pipelines / building-machine-learning-pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson
MIT License
584 stars 250 forks source link

Cannot get artifacts during the train-eval-test split #8

Closed BandaruMeghana closed 4 years ago

BandaruMeghana commented 4 years ago

Hi, Referring to the Cha 03, I split the data into train-eval-test. In the component outputs, I can see the split names. But, when I am trying to print the artifacts, I can't see the 3 artifacts.

Please refer to the screenshot attached. I am running the notebook on colab.

artifacts

.

BandaruMeghana commented 4 years ago

Hi,

I am following the textbook from O'reilly learning platform. I am unable to proceed with the stand-alone chapters because of the dependent data files that are accessed in the notebooks. I see no mention on how these data files are generated. I can't even see those data files in the GitHib repo. Please point me to the right place, if already committed to Git.

Thank you, Meghana

hanneshapke commented 4 years ago

Hi @BandaruMeghana, Thank you for raising the issue. I will try to reproduce it on my end tomorrow.

Best regards, Hannes

hanneshapke commented 4 years ago

Hi @BandaruMeghana,

I am very sorry for my late reply. Thank you again for highlighting the issue. The output of the ingestion components has been updated in one of the recent TFX updates and we missed to update this section. I apologize for that.

Previously, the output of the CSVExampleGen component was providing a list of datasets (one list element per split name). Now, only one output is provided regardless of the number of splits. This simplifies the downstream consumption of the splits.

The ingestion code is still correct! However, example_gen.outputs['examples'].get() will now always only contain one element with the base path of the datasets as the uri. If you want to print the path of the split datasets, you can use the code snippet below:

for artifact in example_gen.outputs['examples'].get():
    base_uri = artifact.uri
    for split in tfx.types.artifact_utils.decode_split_names(artifact.split_names):
        print(os.path.join(base_uri, split)) 

Please note that artifact.split_names is of type str. tfx provides the function tfx.types.artifact_utils.decode_split_names to convert the list represented as a string back to a proper Python list.

I hope this explanation helps. If the answer is helpful, please close the issue and we'll move the solution to O'Reilly's errata page to be included with the next book updates. Otherwise, please reply back here.

BandaruMeghana commented 4 years ago

Hi Hannes,

Yeah, this helped me in viewing my artifacts. Thank you!