Closed dpritchett closed 4 years ago
What input formats do we need to be prepared to provide for you?
This is one thing that we really need to focus on as it can define how easily we can make changes in the future. At a minimum, I highly recommend modeled, structured data whenever possible (instead of JSON blobs). This of course excludes video content but any data that we're receiving from the field in the form of JSON should be able to be modeled up front with a basic schema. If something like Parquet is chosen as the data format (big +1 there if possible) then modeled data is pretty much mandatory.
Where is your ML system going to be running?
How do we need to feed it training data?
/inference
API (flask EC or lambda) that takes a video file or an ID to a videofile in a DB that we can pull -- either way works for me. If we go hard ML and we get a lot of data, we can, on top of it, train it on a daily basis during the night. For this we can write a cron job that runs a training algo on an EC2 that saved the model and places it in an S3What input formats do we need to be prepared to provide for you?
I agree with everything Yoni said :).
OpenCV could also be used depending on the type of algorithm we used. If we end up using a knowledge-based method, then pytorch might be overkill. It works in a similar way to the pytorch implementation with ffmpeg and can read mp4 and avi so, in any case, those video format work. Avi seems to be the only format supported well on both Linux and Mac thought, I remember reading this somewhere.
I don't know if that is relevant to the discussion but I think the final dataset should be saved to either Zenodo or archive.org, just so that it's available to others who want to develop their algorithms. Would the webservice be a sort of link between the training/calibration of the method developed? I've never worked with this kind of service before, so I'm trying to understand the architecture of the project.
I don't know if it's relevant either but I was thinking of structure the dataset like this: https://www.overleaf.com/read/kwfmchzmmgtm
Would that work for the backend?
Hi everyone :) This is just a brief note to say that I'm currently creating a preliminary dataset so we can have an idea of structure and start working on the models.
I'm in lockdown, so I'm using my iPhone and a commercial pulse oximeter to take measurements of myself and 2 family members. I'll be posting an update soon about how to access this data and how it's structured—at which point, feedback will be welcome.
@MalcolmMielle @YoniSchirris this discussion has been idle for a while. Can we close it out?
@haggy Would that issue be relevant to the conversation we had with dario right now?
Might be a bit of a stretch. I see this issue being more about the ML-specific data pipeline, not really exposing our data for external use. I'd rather have Dario open an issue with all the requirements he has for that.
Ok sure let's close it then
@YoniSchirris we need your help understanding what we're going to need to do with our datasets once they're uploaded to a cloud database:
It looks like we're definitely going to have a JSON payload representing our raw data collected as described in #6.
We may also have raw video uploaded to e.g. S3 for after-the-fact analysis (see #8).