Project-AgML / AgML

AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.
Apache License 2.0
167 stars 28 forks source link

Community `data` contributions API #33

Open masonearles opened 1 year ago

masonearles commented 1 year ago

We're opening up this issue regarding how to enable easy, yet high quality, data contributions to AgML. This was raised initially in Issue 15. If you are interested in contributing to this discussion and code development, let's have this conversation below.

geezacoleman commented 1 year ago

What you're doing here with AgML is really awesome, and will make using image datasets for testing/developing so much easier! I think this was briefly mentioned some time ago, but it would be great to form a connection between Weed-AI and AgML for the weeds image side of things.

Weed-AI now supports annotation through CVAT, so unannotated data can be annotated publicly before being uploaded to the platform. We've also worked on establishing agricultural metadata reporting standards for weeds called AgContext, so each dataset has information on where/how it was collected. There is also version control and dataset editing functions too. One limitation is that it currently is only for weeds, not all the various form of image data used in agriculture currently.

Helping make the API is a little beyond my skillset, but if it's something of interest I'd be happy to help some other way. At least it might help make the connection between annotation/upload > standardised metadata > use/editing > download.

KeynesYouDigIt commented 2 months ago

@amogh7joshi / @masonearles where is the s3 bucket located where the data goes now? I can for sure build and API no sweat, but I need a target :)

Also for you 2 plus @geezacoleman / other users

(To be clear your answers are going to depend on your use case and data sets, so just answer for what you know!)

1) what are the MOST important data quality specs, if any ? When should we say about a data set "this is could enough to accept and use" ?

2) what formats is data typically uploaded in?

3) what formats would we like download to support?

4) what else should I know about the most minimal version of this API?

Excited to get started!

KeynesYouDigIt commented 2 months ago

To be clear are we just looking for an automated API to do all of this stuff? Does the data just live here in the repo? https://github.com/Project-AgML/AgML/blob/main/CONTRIBUTING.md

KeynesYouDigIt commented 2 months ago

Perhaps this should be our target landing spot? https://www.tensorflow.org/datasets

masonearles commented 2 months ago

@KeynesYouDigIt As mentioned offline, the data currently lives in a publicly readable S3 bucket. We manage admin for write. It would be great to create a pipeline for AgML users to contribute data but with a gate for an admin to run a QA check before uploading to the S3 bucket.