Reconfigure image and video processing pipelines

Meiqingx commented 1 year ago

Reconfigure these repositories with specific tasks google_2022 fb_2022

High-level questions necessary to move on to the next steps @SebastianZimmeck @efranklinfowler Your input would be greatly appreciated.

Two options for pipeline design: Slide deck, comments-mode

To wrap up pros and cons based on my knowledge so far:

Individual operations approach (Method 1): 1) allows for a streamlined module to achieve all four tasks of video analysis and customization of parameters for each task 2) more code, more efforts on the part of the users.
AWS CDK approach (Method 2): 1) a one-stop shop for three of the four video analysis tasks, and could be used for image processing too. However, face search needs to be done separately, and less flexibility for users to customize requests. 2) We might be able to modify the original pipeline to make it a one-stop shop for all four video analysis tasks. --> This needs further investigation 3) less efforts for users to get quick results after the initial setup.

Let me know if you have follow-up questions!

efranklinfowler commented 1 year ago

Thanks @Meiqingx!

I can see pros and cons to both, and so I'm genuinely interested to hear what @SebastianZimmeck thinks. Some specific questions that would help me in coming to a decision are as follows: (1) method 1 is more code and effort on the part of users, but also seems straightforward in how to use and modify -- do you agree? (2) method 2 is more hands off, but can anyone take the settings of the S3 and modify them if they want? Or are they locked into the pre-configuration? Finally, (3) can you say a little more about utilization of method 2 -- can anyone can take the settings and use it such that they are responsible for paying for their own processing?

SebastianZimmeck commented 1 year ago

@SebastianZimmeck @efranklinfowler Your input would be greatly appreciated.

My thoughts at the moment:

Maybe, I can ask, @Meiqingx, is method just an implementation task or a research task? If the former, it seems to me that we could optimize the time and resources we spend using method 2. On the other hand, if method 1 is a research task and if we wanted to publish a paper based on our own custom method, we could go with that.

For method 2, would we just say to people "you need AWS CDK for running this code, and here is is how you set it up" or would we provide the AWS? If the latter, we may have an issue with costs here.

Meiqingx commented 1 year ago

A relevant issue: My thoughts on notebooks versus Object-Oriented-Programming: @SebastianZimmeck @efranklinfowler

I think the most user friendly approach would be to build the basic modules in python objects (classes), and meanwhile provide a notebook for tutorials (example software developed by social scientists: a few object-oriented-programming modules + a tutorial to import and apply them in notebook format)

The current notebooks, to a large extent imo, involve workflows, credentials, python functions, etc. specific to our team's assets and needs. Not as flexible in terms of execution procedures and customized demands. It would not be easy for a fresh user to run them directly (Speaking from someone who had just tried this)
Notebooks as it currently stands also involve a lot of repetition of utility codes

Let me know if I should open a different issue for this

SebastianZimmeck commented 1 year ago

Unassigning @bella-tassone from this issue as she already has enough on here plate (though, feel free to chime in, @bella-tassone).

SebastianZimmeck commented 1 year ago

@efranklinfowler, on your questions:

(1) method 1 is more code and effort on the part of users, but also seems straightforward in how to use and modify -- do you agree?

I would think that generally it is more code and effort. It may be that it can be simplified and broken down. It depends on the exact details.

(2) method 2 is more hands off, but can anyone take the settings of the S3 and modify them if they want? Or are they locked into the pre-configuration?

If people would be using their own AWS as opposed to ours (which I assume), they could make their own configurations. I would think this would be for all configuration options are available on AWS minus the ones that our code assumes (if any).

can anyone can take the settings and use it such that they are responsible for paying for their own processing?

Yes, I think that is the case.

SebastianZimmeck commented 1 year ago

A relevant issue: My thoughts on notebooks versus Object-Oriented-Programming

I do not have any particular preferences. I would say we should pick whatever lends itself best for the task at hand and not shoehorn it into any technologies that are ill-fitting.

I think the most user friendly approach would be to build the basic modules in python objects (classes), and meanwhile provide a notebook for tutorials

Generally, that may work. What I would want to avoid is to ask people to download x, set up y, and also get z, and and and. We should not overwhelm people with lots of different technologies and tasks to do. And whatever we pick should be easy to use and allow our target audience (academics who know a little bit about handling data and interested general audience members?) to make sense and run the code.

Let me know if I should open a different issue for this

We can discuss the plan here. Once we are more clear, please feel free to close this issue and open individual issues on the execution of the different steps of the plan.

Meiqingx commented 1 year ago

Thanks @Meiqingx!

I can see pros and cons to both, and so I'm genuinely interested to hear what @SebastianZimmeck thinks. Some specific questions that would help me in coming to a decision are as follows: (1) method 1 is more code and effort on the part of users, but also seems straightforward in how to use and modify -- do you agree? (2) method 2 is more hands off, but can anyone take the settings of the S3 and modify them if they want? Or are they locked into the pre-configuration? Finally, (3) can you say a little more about utilization of method 2 -- can anyone can take the settings and use it such that they are responsible for paying for their own processing?

Yeah I also feel like method 1 is more intuitive for an average user and perhaps easier to modify. 2) My understanding of method 2 is that users are locked into their pre-configuration, but among the tasks available in the pre-defined pipeline, they can choose which task or combination of tasks to perform. They don't have to perform all of them. But both of these approaches would require users to set up their own AWS credentials and pay for their own processing.

SebastianZimmeck commented 1 year ago

Maybe, come up with a first approximation of an implementation for a solution (using method 1, it sounds like), @Meiqingx, to see how it goes?

Meiqingx commented 1 year ago

My thoughts at the moment:

Maybe, I can ask, @Meiqingx, is method just an implementation task or a research task? If the former, it seems to me that we could optimize the time and resources we spend using method 2. On the other hand, if method 1 is a research task and if we wanted to publish a paper based on our own custom method, we could go with that.

I do think it would be ideal, if we could turn the pipeline itself into a research project (for example, this project used AWS rekognition service towards a specific subject domain and was published on Nature Methods).

However, I speculate Method 2 could potentially turn into a research task too, conditional on significant changes to the pipeline, which may or may not take more efforts to build than Method 1 (this example is still from Amazon, but shows that (maybe?) we could potentially build our own CDK stack for political ads analysis from scratch)

That said, I'm open to prioritizing implementation at this stage using method 2, depending on our research priorities (which I'm still learning). We could build a working pipeline first and make further modifications for a research paper later.

For method 2, would we just say to people "you need AWS CDK for running this code, and here is is how you set it up" or would we provide the AWS? If the latter, we may have an issue with costs here.

Yes. My understanding is that for both Methods, the costs come from using the AWS Rekognition service, which is necessary for both of them). However, Method 2 (deploying a CDK) requires more data storage on AWS S3 buckets and interactions with those S3 buckets, which can all increase costs. There should be a free tier for using Rekognition and interacting with the S3 buckets (I don't know how the freebie limit measures up to our data size. Perhaps Breeze @sheoftensaid knows more about the implications on billing?).

efranklinfowler commented 1 year ago

One last question @Meiqingx: are there implications for us either way in terms of time or cost in creating modifications for 2022 specifically? I definitely would like us to think about improvements going forward, but we have strong incentives to finish the 2022 dataset in its entirety first. Pinging @sheoftensaid just in case she has anything else to say from a program management perspective (I know she'll want to know about costs in particular).

Meiqingx commented 1 year ago

Maybe, come up with a first approximation of an implementation for a solution (using method 1, it sounds like), @Meiqingx, to see how it goes?

I was still typing my response to your first comment when your second comment came up, but using Method 1 for now sounds good.

Meiqingx commented 1 year ago

One last question @Meiqingx: are there implications for us either way in terms of time or cost in creating modifications for 2022 specifically? I definitely would like us to think about improvements going forward, but we have strong incentives to finish the 2022 dataset in its entirety first. Pinging @sheoftensaid just in case she has anything else to say from a program management perspective (I know she'll want to know about costs in particular).

I was just commenting on these exact points while you were typing! :) Yeah, for the additional 2022 dataset I already used the Method 2 pipeline and got the raw results. For any analysis of this new batch of 400 videos that belong to Google 2022, I won't wait after we have the new pipeline (I'm just using a meld of both, whichever works, on my own notebook to get the final results --> unless you want to present not just the data but also the new pipeline? If only the results are sufficient, then we good!). These questions would only apply to the unpacking of existing repository for future use, unless there are other urgent tasks that require the application of the new pipeline. If so, let me know.

sheoftensaid commented 1 year ago

@Meiqingx For cost reasons, we'd want to avoid rerunning any 2022 data. It sounds like you are suggesting changes only for our 2024 (future) processing, though, so no problem there.

I envisioned this task as separating Jielu's repos into individual repos (as Markus did for his repos) but not changing the methods used. I think it makes sense to keep the 2022 methods as is, but it absolutely makes sense to create 2024 repos to explore other methods for future processing!

It was my understanding that we were trying to change:

Repo	Includes
Repo1:	task1, task2, task3, task4

into

Repo	Includes
Repo1:	task1
Repo2:	task2
Repo3:	task3
Repo4:	task4

without making changes to the methods used (for 2022).

SebastianZimmeck commented 1 year ago

I was still typing my https://github.com/Wesleyan-Media-Project/aws-rekognition-image-video-processing/issues/1#issuecomment-1724576633 to your first comment when your second comment came up, but using Method 1 for now sounds good.

Sounds good (and happy to go with the method everyone thinks we should be taking here).

Wesleyan-Media-Project / aws-rekognition-image-video-processing

Reconfigure image and video processing pipelines #1