danforthcenter / plantcv

Plant phenotyping with image analysis
Mozilla Public License 2.0
633 stars 259 forks source link

Is it possible to manually add metadata to a non-workflow job? #490

Closed gsainsbury86 closed 7 months ago

gsainsbury86 commented 4 years ago

Hi all! I've finally got some time to give implementing PlantCV at scale a crack at The Plant Accelerator.

I've had a look at the plantcv-workflow.py and while I could probably get it working this way, I'd rather analyse the data in situ rather than export it from LemnaTec into a particular directory structure first.

Essentially I'm querying the LemnaTec database to get metadata and image file paths then reading in the image using via an ssh/sftp connection to the image server and analysing it. It's all going well but I'd like to be able to attach that metadata to the plantCV output so when it comes time to do statistical analysis on that data, it can be sorted through a bit more easily than matching filenames or some such. (I should maybe be writing these directly to a database but at the moment I'm just hacking it together.)

I wasn't able to find an add_metadata() method anywhere. It looks like there's a few lines that do this within the job_builder but that is then written directly to a file.

        # Valid metadata
        for m in list(valid_meta.keys()):
            img_meta["metadata"][m]["value"] = meta[img][m]

When I print the results as JSON, there is an empty metadata dictionary. Essentially, I want to know how to populate that. It's definitely possible that I just haven't found the way to do this, and there is a way suggested but I couldn't see it. Basically, I'm missing the starred line below.

images_to_analyse <- query LT database for image path and metadata

for image in images_to_analyse
    pcv.analyse(image)
    **pcv.add_metadata(image, metadata)**
    pcv.print_results(filename)
HaleySchuhl commented 4 years ago

Hi @gsainsbury86 ,

We have definitely talked about the need for an add_metadata method for the Outputs class, especially since metadata isn't automatically collected unless running using plantcv-workflow. @nfahlgren is doing some tweaking to the json formatting to more easily allow for multiple entities per image and I'm sure we can incorporate an add_metadata method (hopefully before the v3.7 release we plan to do soon).

For the short term, one work around is to utilize the add_observation method. This way if you had a list a metadata you could still get it into a csv with everything else recorded with print_results.

pcv.outputs.add_observation(variable='timestamp', trait='image timestamp', method='', scale='', datatype=float, value=metadata_list[0], label='')

pcv.outputs.add_observation(variable='treatment', trait='plant treatment', method='', scale='', datatype=float, value=metadata_list[1], label='')
nfahlgren commented 4 years ago

Hi @gsainsbury86!

A bit more background on our current approach. You are correct that the only entry point for metadata is currently through running plantcv-workflow.py. The process looks roughly like:

plantcv-workflow => metadata JSON files (1 per image) => user workflow => append observations to metadata JSON files => plantcv-workflow processes/aggregates metadata and observations into a single output JSON file.

It makes sense to me, as in your use case, that we need a way to bypass the first two steps so that people can batch execute their workflows another way but still get metadata.

As @HaleySchuhl said, we coincidentally were discussing a way to add metadata to the observations collector anyway due to the need for a structure that can collect data for multiple objects in a single image (e.g. multiple plants, seeds, etc.). We have the framework diagrammed out, and I think we can implement it pretty quickly.

Could you provide us an example of the metadata you would be working with? At the moment we also have a pretty limited metadata vocabulary in place, but we would be happy to expand that as well.

gsainsbury86 commented 4 years ago

Hi @nfahlgren and @HaleySchuhl , thanks for the swift responses!

I'm still not exactly sure how our workflow automation/scaling is going to work. I've liked the idea of containerised workflows orchestrated using something a layer above - I'm attending a workshop/hackathon on https://www.nextflow.io/ next week and think that it might be a nice approach because we can scale up to cloud. If I do get it working, I'll be sure to share it.

In your description of the process, Noah, you said that it aggregates the metadata and observations into a single file - is that one file per plant/workflow or one for the entire set? Because that's another thing I need to consider. Reading hundreds of JSON files into R or some such is fine but a bit clunky. However it is nice to have single files associated with single images if we want to examine just one.

As for the metadata we're capturing, it's not particularly sophiscated at the moment, it's largely just treatment groups / LemnaTec spatial data. For instance, the first experiment that I want to test at scale has the following fields:

And we'll often have other treatments like salinity level, soil type, nitrogen level etc.

For testing purposes, @HaleySchuhl 's work around should be more than sufficient.

nfahlgren commented 4 years ago

Nice! Nextflow (or another workflow engine like Parsl, Snakemake, etc.) is part of our roadmap for upgrading plantcv-workflow.py

Screen Shot 2019-11-06 at 4 56 39 PM

Our main goals with using an existing workflow engine are 1) scalability, and 2) interoperability between infrastructure.

I like Nextflow a lot but the only drawback for me is that we would potentially have to rewrite parts of plantcv-workflow.py in Groovy, depending on how we integrated with Nextflow. I didn't have much success with a brief go at using Snakemake, I don't think it quite works for what we are trying to do (though maybe I was just doing stuff wrong). Parsl is a younger project but I like what they are doing, and it's a bit easier to embed it into Python. They just released v0.9 that had some features I was waiting for.

But definitely let us know what your experience is with Nextflow or other methods!

The current setup ends up with a single aggregated JSON file, the intermediate JSON files can be discarded after aggregation. We also provide the plantcv-utils.py json2csv utility to convert the big JSON file into two CSV files, one for single-value measurements (e.g. area, height, etc.) and one for multi-value measurements (e.g. hue frequency distribution, etc.). Both tables have the same metadata included and can be merged together in R as needed.

Thanks for the metadata, we will incorporate those into the built-in vocabulary. We have similar information but it's structured a bit differently so I think I will need to do a bit of work to make it more inclusive. That being said, I would envision that the add_metadata function would allow the user to add metadata not covered under the predefined vocabulary.

gsainsbury86 commented 4 years ago

Oh great! Glad to hear you're looking into a bunch of options here. I think some combination of these should be able to get us to where we want to be. Because it'd been a while since I'd looked into things, I was using just the single-image pipeline code and then doing the automation/workflow stuff exterior to PlantCV. It was only in looking for how to add metadata that I even realised that you'd built plantcv-workflow.py.

plantcv-utils.py sounds super useful also. I'll have a go with that today!

Yeah, our metadata tends to be single plant-specific. as opposed to at the experiment level in general. Though I've just realised that I neglected to mention things like camera labels and time stamps as metadata that we also capture. I fall into the trap of thinking of 'metadata' as the stuff we attached to LemnaTec analysis jobs whereas lots of other information that we still have, collect and analyse is also metadata.