CoVital-Project / pulse-ox-data-collection-web-service

HTTPS API for receiving pulse oximetry from mobile clients
https://covital.org
GNU General Public License v3.0
5 stars 4 forks source link

Need a rough architecture plan for post-upload data pipelines for our initial ML training #9

Closed dpritchett closed 4 years ago

dpritchett commented 4 years ago

@YoniSchirris we need your help understanding what we're going to need to do with our datasets once they're uploaded to a cloud database:

It looks like we're definitely going to have a JSON payload representing our raw data collected as described in #6.

We may also have raw video uploaded to e.g. S3 for after-the-fact analysis (see #8).

haggy commented 4 years ago

What input formats do we need to be prepared to provide for you?

This is one thing that we really need to focus on as it can define how easily we can make changes in the future. At a minimum, I highly recommend modeled, structured data whenever possible (instead of JSON blobs). This of course excludes video content but any data that we're receiving from the field in the form of JSON should be able to be modeled up front with a basic schema. If something like Parquet is chosen as the data format (big +1 there if possible) then modeled data is pretty much mandatory.

YoniSchirris commented 4 years ago

Where is your ML system going to be running?

How do we need to feed it training data?

What input formats do we need to be prepared to provide for you?

MalcolmMielle commented 4 years ago

I agree with everything Yoni said :).

OpenCV could also be used depending on the type of algorithm we used. If we end up using a knowledge-based method, then pytorch might be overkill. It works in a similar way to the pytorch implementation with ffmpeg and can read mp4 and avi so, in any case, those video format work. Avi seems to be the only format supported well on both Linux and Mac thought, I remember reading this somewhere.


I don't know if that is relevant to the discussion but I think the final dataset should be saved to either Zenodo or archive.org, just so that it's available to others who want to develop their algorithms. Would the webservice be a sort of link between the training/calibration of the method developed? I've never worked with this kind of service before, so I'm trying to understand the architecture of the project.

MalcolmMielle commented 4 years ago

I don't know if it's relevant either but I was thinking of structure the dataset like this: https://www.overleaf.com/read/kwfmchzmmgtm

Would that work for the backend?

gianlucatruda commented 4 years ago

Hi everyone :) This is just a brief note to say that I'm currently creating a preliminary dataset so we can have an idea of structure and start working on the models.

I'm in lockdown, so I'm using my iPhone and a commercial pulse oximeter to take measurements of myself and 2 family members. I'll be posting an update soon about how to access this data and how it's structured—at which point, feedback will be welcome.

haggy commented 4 years ago

@MalcolmMielle @YoniSchirris this discussion has been idle for a while. Can we close it out?

MalcolmMielle commented 4 years ago

@haggy Would that issue be relevant to the conversation we had with dario right now?

haggy commented 4 years ago

Might be a bit of a stretch. I see this issue being more about the ML-specific data pipeline, not really exposing our data for external use. I'd rather have Dario open an issue with all the requirements he has for that.

MalcolmMielle commented 4 years ago

Ok sure let's close it then