Closed Meiqingx closed 12 months ago
Thanks @Meiqingx!
I can see pros and cons to both, and so I'm genuinely interested to hear what @SebastianZimmeck thinks. Some specific questions that would help me in coming to a decision are as follows: (1) method 1 is more code and effort on the part of users, but also seems straightforward in how to use and modify -- do you agree? (2) method 2 is more hands off, but can anyone take the settings of the S3 and modify them if they want? Or are they locked into the pre-configuration? Finally, (3) can you say a little more about utilization of method 2 -- can anyone can take the settings and use it such that they are responsible for paying for their own processing?
@SebastianZimmeck @efranklinfowler Your input would be greatly appreciated.
My thoughts at the moment:
Maybe, I can ask, @Meiqingx, is method just an implementation task or a research task? If the former, it seems to me that we could optimize the time and resources we spend using method 2. On the other hand, if method 1 is a research task and if we wanted to publish a paper based on our own custom method, we could go with that.
For method 2, would we just say to people "you need AWS CDK for running this code, and here is is how you set it up" or would we provide the AWS? If the latter, we may have an issue with costs here.
A relevant issue: My thoughts on notebooks versus Object-Oriented-Programming: @SebastianZimmeck @efranklinfowler
I think the most user friendly approach would be to build the basic modules in python objects (classes), and meanwhile provide a notebook for tutorials (example software developed by social scientists: a few object-oriented-programming modules + a tutorial to import and apply them in notebook format)
Let me know if I should open a different issue for this
Unassigning @bella-tassone from this issue as she already has enough on here plate (though, feel free to chime in, @bella-tassone).
@efranklinfowler, on your questions:
(1) method 1 is more code and effort on the part of users, but also seems straightforward in how to use and modify -- do you agree?
I would think that generally it is more code and effort. It may be that it can be simplified and broken down. It depends on the exact details.
(2) method 2 is more hands off, but can anyone take the settings of the S3 and modify them if they want? Or are they locked into the pre-configuration?
If people would be using their own AWS as opposed to ours (which I assume), they could make their own configurations. I would think this would be for all configuration options are available on AWS minus the ones that our code assumes (if any).
can anyone can take the settings and use it such that they are responsible for paying for their own processing?
Yes, I think that is the case.
A relevant issue: My thoughts on notebooks versus Object-Oriented-Programming
I do not have any particular preferences. I would say we should pick whatever lends itself best for the task at hand and not shoehorn it into any technologies that are ill-fitting.
I think the most user friendly approach would be to build the basic modules in python objects (classes), and meanwhile provide a notebook for tutorials
Generally, that may work. What I would want to avoid is to ask people to download x, set up y, and also get z, and and and. We should not overwhelm people with lots of different technologies and tasks to do. And whatever we pick should be easy to use and allow our target audience (academics who know a little bit about handling data and interested general audience members?) to make sense and run the code.
Let me know if I should open a different issue for this
We can discuss the plan here. Once we are more clear, please feel free to close this issue and open individual issues on the execution of the different steps of the plan.
Thanks @Meiqingx!
I can see pros and cons to both, and so I'm genuinely interested to hear what @SebastianZimmeck thinks. Some specific questions that would help me in coming to a decision are as follows: (1) method 1 is more code and effort on the part of users, but also seems straightforward in how to use and modify -- do you agree? (2) method 2 is more hands off, but can anyone take the settings of the S3 and modify them if they want? Or are they locked into the pre-configuration? Finally, (3) can you say a little more about utilization of method 2 -- can anyone can take the settings and use it such that they are responsible for paying for their own processing?
Yeah I also feel like method 1 is more intuitive for an average user and perhaps easier to modify. 2) My understanding of method 2 is that users are locked into their pre-configuration, but among the tasks available in the pre-defined pipeline, they can choose which task or combination of tasks to perform. They don't have to perform all of them. But both of these approaches would require users to set up their own AWS credentials and pay for their own processing.
Maybe, come up with a first approximation of an implementation for a solution (using method 1, it sounds like), @Meiqingx, to see how it goes?
My thoughts at the moment:
Maybe, I can ask, @Meiqingx, is method just an implementation task or a research task? If the former, it seems to me that we could optimize the time and resources we spend using method 2. On the other hand, if method 1 is a research task and if we wanted to publish a paper based on our own custom method, we could go with that.
I do think it would be ideal, if we could turn the pipeline itself into a research project (for example, this project used AWS rekognition service towards a specific subject domain and was published on Nature Methods).
However, I speculate Method 2 could potentially turn into a research task too, conditional on significant changes to the pipeline, which may or may not take more efforts to build than Method 1 (this example is still from Amazon, but shows that (maybe?) we could potentially build our own CDK stack for political ads analysis from scratch)
That said, I'm open to prioritizing implementation at this stage using method 2, depending on our research priorities (which I'm still learning). We could build a working pipeline first and make further modifications for a research paper later.
For method 2, would we just say to people "you need AWS CDK for running this code, and here is is how you set it up" or would we provide the AWS? If the latter, we may have an issue with costs here.
Yes. My understanding is that for both Methods, the costs come from using the AWS Rekognition service, which is necessary for both of them). However, Method 2 (deploying a CDK) requires more data storage on AWS S3 buckets and interactions with those S3 buckets, which can all increase costs. There should be a free tier for using Rekognition and interacting with the S3 buckets (I don't know how the freebie limit measures up to our data size. Perhaps Breeze @sheoftensaid knows more about the implications on billing?).
One last question @Meiqingx: are there implications for us either way in terms of time or cost in creating modifications for 2022 specifically? I definitely would like us to think about improvements going forward, but we have strong incentives to finish the 2022 dataset in its entirety first. Pinging @sheoftensaid just in case she has anything else to say from a program management perspective (I know she'll want to know about costs in particular).
Maybe, come up with a first approximation of an implementation for a solution (using method 1, it sounds like), @Meiqingx, to see how it goes?
I was still typing my response to your first comment when your second comment came up, but using Method 1 for now sounds good.
One last question @Meiqingx: are there implications for us either way in terms of time or cost in creating modifications for 2022 specifically? I definitely would like us to think about improvements going forward, but we have strong incentives to finish the 2022 dataset in its entirety first. Pinging @sheoftensaid just in case she has anything else to say from a program management perspective (I know she'll want to know about costs in particular).
I was just commenting on these exact points while you were typing! :) Yeah, for the additional 2022 dataset I already used the Method 2 pipeline and got the raw results. For any analysis of this new batch of 400 videos that belong to Google 2022, I won't wait after we have the new pipeline (I'm just using a meld of both, whichever works, on my own notebook to get the final results --> unless you want to present not just the data but also the new pipeline? If only the results are sufficient, then we good!). These questions would only apply to the unpacking of existing repository for future use, unless there are other urgent tasks that require the application of the new pipeline. If so, let me know.
@Meiqingx For cost reasons, we'd want to avoid rerunning any 2022 data. It sounds like you are suggesting changes only for our 2024 (future) processing, though, so no problem there.
I envisioned this task as separating Jielu's repos into individual repos (as Markus did for his repos) but not changing the methods used. I think it makes sense to keep the 2022 methods as is, but it absolutely makes sense to create 2024 repos to explore other methods for future processing!
It was my understanding that we were trying to change:
Repo | Includes |
---|---|
Repo1: | task1, task2, task3, task4 |
into
Repo | Includes |
---|---|
Repo1: | task1 |
Repo2: | task2 |
Repo3: | task3 |
Repo4: | task4 |
without making changes to the methods used (for 2022).
I was still typing my https://github.com/Wesleyan-Media-Project/aws-rekognition-image-video-processing/issues/1#issuecomment-1724576633 to your first comment when your second comment came up, but using Method 1 for now sounds good.
Sounds good (and happy to go with the method everyone thinks we should be taking here).
Reconfigure these repositories with specific tasks google_2022 fb_2022
High-level questions necessary to move on to the next steps @SebastianZimmeck @efranklinfowler Your input would be greatly appreciated.
Two options for pipeline design: Slide deck, comments-mode
To wrap up pros and cons based on my knowledge so far:
Let me know if you have follow-up questions!