Create data pipeline for Google Colab

StanfordHCI / bang

💥 Helping people meet for the first time, more than once 💥

MIT License

2 stars 1 forks source link

Create data pipeline for Google Colab #399

Closed tawnzz closed 5 years ago

tawnzz commented 5 years ago

We should create a way to funnel the existing data from the batches (the Mongo database) onto Google Colab (our collaboration notebook). Right now we're downloading JSON data from specific batches, but this method will become really inefficient once we start collecting large batches of data.

This might require permission/keys so that we can access the data on Bang.

Should discuss this further! @markwhiting @deliveryweb

deliveryweb commented 5 years ago

Tell more about format pls. How do you see it inside colab? Can you give us example?

markwhiting commented 5 years ago

I think that all that needs to happen on the @deliveryweb side is making it possible for us to log into the DB from python in colab. This may already be possible with the existing mongo setup, but it will be accessed remotely from a different machine.

deliveryweb commented 5 years ago

Okay. But to clarify: we have loadBatchResult endpoint. All batch data with surveys. Maybe you can use it from python? Or you need all db data? Снимок экрана от 2019-07-31 16-33-11

markwhiting commented 5 years ago

@tonyanguyen why not try that endpoint and see if it does everything you need.

markwhiting commented 5 years ago

Ideally we would have a way to get all data without knowing batch ids, or at least, have a way to see all the batch ids from an endpoint, so we could then call this one with each id.

deliveryweb commented 5 years ago

we have endpoint for batch list load (4-5 fields). with id of course.

tawnzz commented 5 years ago

Hey @deliveryweb,

TDLR; We can get data through multiple endpoints and are thinking about a plan to make it clean and efficient. However, we would like for you to make a SINGLE endpoint with all the data because it may be easier for us if we don't have to iterate through.

Thanks for pointing out this endpoint. So the GET request works, but we're worried that it might take a long time if we iterate through all the batchLists.

The current solution we're considering implementing is: 1) Iterate through the batch lists and send a GET request for each ID 2) For all the batches that we've already iterated on, we'll store them. 3) Then, we'll iterate on a weekly basis for new batches in the batch list and repeat.

What do you think?

deliveryweb commented 5 years ago

all queries must have 'admintoken' in headers (with right value)

GET to https://bang-prod.deliveryweb.ru:3001/api/admin/batches/?full=true for all batches GET to https://bang-prod.deliveryweb.ru:3001/api/admin/batch-result/:id/ for one batch

tawnzz commented 5 years ago

Thank you @deliveryweb !