How to specify source files and validate data when using process_results and json2csv

connornelle commented 6 months ago

Hello,

I have been running a PCV workflow on a HPC cluster, and currently it outputs results files the same way that local parallelization would work (however the for the Slurm array I am using a single image workflow over the whole array of images). The results files are as such: "IMG_2_IGP0001_results.txt" and "IMG_2_IGP0002_results.txt" for example. There are 212 images. I run the following code over the results directory.

`import os import sys, traceback from plantcv import plantcv as pcv from plantcv.parallel import WorkflowInputs from plantcv import parallel import matplotlib from matplotlib import pyplot as plt

parallel.process_results(job_dir="/home/m18c364/ondemand/data/sys/myjobs/projects/default/19/results", json_file="combined_output.txt")`

Which does output a very long file, I think this is where I am losing individual image markers, becuase after I run the json2csv code locally in the terminal the wide format csv file only has one row called "default_1". I think this is probably just a simple thing I have to add to the function, but I haven't found it in the documentation. I have attached the files too. Thank you! combined_output.txt output.csv-multi-value-traits.csv output.csv-single-value-traits.csv

HaleySchuhl commented 6 months ago

Hi @connornelle , thanks for opening this issue. Since you are parallelizing without using plantcv-run-workflow it looks like your outputs have no metadata. When json2csv creates the CSV file there is no metadata (such as the image file name) to use as a unique data frame key. To resolve this I believe you can use the pcv.outputs.add_metadata method that we recently added to store the image filename under the term filepath.

connornelle commented 6 months ago

Thank you! This appears to be working well. I found that implementing the single img workflow in parallel on our cluster was much easier than trying to use the built in. Is there a resource on using the built in version? I didn't know where to start getting that running.

HaleySchuhl commented 6 months ago

Awesome!

I believe the best documentation page for our parallelization where it details how to setup a configuration file is here. We'll likely be adding a Scribe doc page additionally since we are finding their formatting to be really useful for processes that involve switching between multiple applications.

danforthcenter / plantcv

How to specify source files and validate data when using process_results and json2csv #1516