Need a rough architecture plan for post-upload data pipelines for our initial ML training

dpritchett commented 4 years ago

@YoniSchirris we need your help understanding what we're going to need to do with our datasets once they're uploaded to a cloud database:

Where is your ML system going to be running?
How do we need to feed it training data?
What input formats do we need to be prepared to provide for you?

It looks like we're definitely going to have a JSON payload representing our raw data collected as described in #6.

We may also have raw video uploaded to e.g. S3 for after-the-fact analysis (see #8).

haggy commented 4 years ago

What input formats do we need to be prepared to provide for you?

This is one thing that we really need to focus on as it can define how easily we can make changes in the future. At a minimum, I highly recommend modeled, structured data whenever possible (instead of JSON blobs). This of course excludes video content but any data that we're receiving from the field in the form of JSON should be able to be modeled up front with a basic schema. If something like Parquet is chosen as the data format (big +1 there if possible) then modeled data is pretty much mandatory.

YoniSchirris commented 4 years ago

Where is your ML system going to be running?

If simple model suffices, frontend. If more complex model required, backend. If on backend, we can have an EC flask server for this, or an AWS lambda function that either 1. calculates or 2. pulls the pickled model from an S3 bucket and uses that.

How do we need to feed it training data?

If it's a relatively simple model, it need not be continuously trained. We will have an /inference API (flask EC or lambda) that takes a video file or an ID to a videofile in a DB that we can pull -- either way works for me. If we go hard ML and we get a lot of data, we can, on top of it, train it on a daily basis during the night. For this we can write a cron job that runs a training algo on an EC2 that saved the model and places it in an S3

What input formats do we need to be prepared to provide for you?

I'm expecting to use pytorch's video loading: https://pytorch.org/docs/stable/_modules/torchvision/io/video.html#read_video. I think this can handle the most standard video formats: mp4, avi... it uses http://docs.mikeboers.com/pyav/develop/api/video.html which uses https://ffmpeg.org/. I never worked with video transfers before, let me know if you need more information.

MalcolmMielle commented 4 years ago

I agree with everything Yoni said :).

OpenCV could also be used depending on the type of algorithm we used. If we end up using a knowledge-based method, then pytorch might be overkill. It works in a similar way to the pytorch implementation with ffmpeg and can read mp4 and avi so, in any case, those video format work. Avi seems to be the only format supported well on both Linux and Mac thought, I remember reading this somewhere.

I don't know if that is relevant to the discussion but I think the final dataset should be saved to either Zenodo or archive.org, just so that it's available to others who want to develop their algorithms. Would the webservice be a sort of link between the training/calibration of the method developed? I've never worked with this kind of service before, so I'm trying to understand the architecture of the project.

MalcolmMielle commented 4 years ago

I don't know if it's relevant either but I was thinking of structure the dataset like this: https://www.overleaf.com/read/kwfmchzmmgtm

Would that work for the backend?

gianlucatruda commented 4 years ago

Hi everyone :) This is just a brief note to say that I'm currently creating a preliminary dataset so we can have an idea of structure and start working on the models.

I'm in lockdown, so I'm using my iPhone and a commercial pulse oximeter to take measurements of myself and 2 family members. I'll be posting an update soon about how to access this data and how it's structured—at which point, feedback will be welcome.

haggy commented 4 years ago

@MalcolmMielle @YoniSchirris this discussion has been idle for a while. Can we close it out?

MalcolmMielle commented 4 years ago

@haggy Would that issue be relevant to the conversation we had with dario right now?

haggy commented 4 years ago

Might be a bit of a stretch. I see this issue being more about the ML-specific data pipeline, not really exposing our data for external use. I'd rather have Dario open an issue with all the requirements he has for that.

MalcolmMielle commented 4 years ago

Ok sure let's close it then

CoVital-Project / pulse-ox-data-collection-web-service

Need a rough architecture plan for post-upload data pipelines for our initial ML training #9