gdg-x / hub

[DEPRECATED] API Data Hub for the Global GDG Community
https://hub.gdgx.io
Apache License 2.0
36 stars 19 forks source link

Write a backend microservice in GAE to handle ingestion from Meetup.com #101

Open Splaktar opened 7 years ago

Splaktar commented 7 years ago

This should use the service: backend annotation in GAE so that it can run in parallel with our frontend web service. It may make sense to make this a separate repo (hub-ingest) or something like that. It should use the same production MongoDB and GDG-X Cloud project, but in no other way should it be coupled to the existing Hub backend.

Phases

  1. The first piece of this work is to get this running locally and ingesting a single chapter's Meetup.com events into your local MongoDB
  2. The second phase is to ingest and update all GDG chapters with data from Meetup.com into your local MongoDB
  3. The third phase is to ingest events for all chapters into your local MongoDB. We likely want to look at using PubSub and GAE Cron for this, but we should avoid our current pattern of making as many API calls as fast as possible.
  4. The fourth phase is to ingest other metadata from Meetup.com like organizers, etc. into your local MongoDB
  5. The final phase will be the last round of code reviews, testing, and then production deployment.
VikramTiwari commented 7 years ago

Why not a Cloud Dataflow job to do this? Is the aim to stay as close to GCP or be platform independent?

Splaktar commented 7 years ago

Cloud Dataflow would be fine, does it fit well with the pattern of ingesting via Meetup.com REST APIs? I personally haven't used it yet. Would the code have to be in Java/Scala?

The project started fairly GCP independant as it needed to be hosted freely on Heroku and OpenShift. But now that Google DevRel is covering GCP costs, it is no problem to use GCP-specific tech and designs.

VikramTiwari commented 7 years ago

Yeah! Though Dataflow is not part of the free plan. I plan on writing using Python and using Apache Beam SDK, so there shouldn't be any issue with dependence. Though we might have to run it as CRON script until either GCP can sponsor gdg-x or Dataflow comes as part of free platform.

Interesting would be to see Cloud Functions support python and then we should be able to run it there too, which is under free section.

Splaktar commented 7 years ago

@VikramTiwari It doesn't need to be free. It should of course be cost effective, but we aren't restricted to free as Google DevRel is covering our expenses.

The current ingestion (and entire project) is written in JavaScript (MEAN stack), so bringing in another language (Python) is not ideal. But if it is much easier to do in Python and only the Meetup.com API ingestion piece needs to be in Python, then that should be OK.

Cloud Functions are NodeJS only at this point. It would be possible to run it as a GCF using GAE Cron to call a service that sends a PubSub message. Then the GCF would listen to that topic and do ingestion when the message comes in. I don't really think that this is an ideal solution though.

What do Dataflow jobs look like? Do you deploy them to Dataflow or are they just GAE services that call Dataflow APIs? Can you detail some of what the Dataflow design would look like please?

VikramTiwari commented 7 years ago

I looked into Dataflow and it seems like python support is very limited as of now. If the complete app's backend is NodeJS then a cloud function with GAE Cron seem like a good design. Why do you think this as non-ideal? We can even make the function parameterized and HTTP execution based so that update of database happens can also happen in the realtime. Let me know your thoughts.

Splaktar commented 7 years ago

@VikramTiwari We may want to understand the Meetup.com API quotas better before settling on a design.

It seems like we should ingest once per chapter (approx every 6 hours if possible), with that ingest updating the chapter's details, events, and organizers. We could chain these into separate steps via PubSub and GCF.

Current stats:

The second is the official number and it is expected to grow beyond 500 on May 1st. Will Meetup.com be OK with us hitting them with 1500-2500 API requests every 6 hours?

As for deciding on GCF vs GAE, what are the pros and cons?

GAE standard backend service

Pros

Cons

GAE Flex backend service

Pros

Cons

Google Cloud Functions (GCFs)

Pros

Cons

What else am I missing here?

pmoosh commented 7 years ago

For now - watching.

BrockMcKean commented 7 years ago

The hourly limits are included in the response as X-RateLimit headers described by https://www.meetup.com/meetup_api/#limits However, https://groups.google.com/forum/#!topic/meetup-api/7xpoMZF43CQ seems to suggest that the default is 200/hr (+3 years ago), which means the upper limit every 6 hours would be 1200, 300 below your projected minimum of 1500. Assuming this is correct, it will require 3 IP addresses based on @Splaktar 's consumption projections.

I don't understand Cloud Endpoints enforced quotas well enough to give a definitive analysis, but given that the limits are in the header, it would seem best not to run this on a static cron, but on a callback based on the X-RateLimit header. I'll take a look at integrating https://www.npmjs.com/package/meetup-api tomorrow.

Splaktar commented 7 years ago

@BrockMcKean great, thanks for looking into the Meetup.com API, those details are quite helpful.

As far as integration, the thought so far has been that this Meetup.com ingestion could take place in a separate GAE service. But I didn't really get a response to https://github.com/gdg-x/hub/issues/101#issuecomment-285963149 in order to decide how we should finalize the design.

I don't think that using Cloud Endpoints is really required for doing ingestion from the Meetup.com API.

Should we design the system to publish tasks to PubSub for each update that is needed, then pull from PubSub (and process the ingest) at rate that is acceptable to Meetup.com (i.e. 200/hr)? I'm not sure that Cloud Functions can be configured in this way (I think that they spin up every time a message is published on a topic).

Splaktar commented 7 years ago

@BrockMcKean Oh and if you are trying to test adding the Meetup.com ingest to the existing project, please use this branch since it contains some pretty significant changes over develop.

We could certainly keep the Meetup.com ingest as part of this same project which will be deployed to GAE Flex, then just setup a cron job to call a task that publishes to PubSub all of the chapters/events that haven't been updated in x time (around 6-12 hours is what we've done in the past).

VikramTiwari commented 7 years ago

As for this comment, I think GCF should be our choice since I am unable to understand the need for Cloud Endpoints security there. If the requests to Meetup API are being made from our side and result comes to us and gets stored in our database, where does the part of Cloud Endpoints even come in picture?

Regarding function invocation, a simple pub-sub request should be more than enough to trigger it, which should be really easy to integrate with a CRON based task on Hub itself.

Splaktar commented 7 years ago

I agree that Cloud Endpoints aren't needed for kicking off the Meetup.com ingestion.

@VikramTiwari Are you envisioning only a single event being sent to PubSub to 'update all data from Meetup.com API'? How do you envision things working after the GCF is triggered, especially in regards to a 200 request per hour Meetup.com API quota?

VikramTiwari commented 7 years ago

I am not sure if 200 requests per hour limit is still applicable. Most of the API teams have started using a rolling average to limit unethical usage of their APIs which works on various constructs such as time, IP address, credentials etc. If there is an upper limit then we will need to tune our code. The way I envision code to be is to run these ingestions based upon data classification. A very basic classification is:

This is just to demonstrate, I am not sure what is best breakup or what is the best time duration. But we should not use a blanket value for all the data.

Also as @BrockMcKean mentioned, X-RateLimit header will be pretty useful. If we are approaching limits, we can slow down our queries. Though this might not be true in case of async.parallel queries.

Also, I am pretty sure that if we ask meetup support, they can provide special higher level quota for our application, but first let's make sure that we are optimized.

Splaktar commented 7 years ago

OK, sounds like we should start working on a GCF for doing Meetup.com API ingestion. We can mess with the PubSub triggering stuff after we get ingestion into Mongo working.

VikramTiwari commented 7 years ago

@Splaktar I can use my API key but will need sample chapter id to start getting the data and rest of the process.

@BrockMcKean If you wanna take a lead on this, go ahead. I won't be getting much time but if no one is working on it actively, I will try to contribute as much as I can.

Splaktar commented 7 years ago

@VikramTiwari Sample chapter ID on the Hub: 103959793061819610212. Not sure what the Meetup.com chapter ID is other than gdgspacecoast.

BrockMcKean commented 7 years ago

I'll commit an integration of meetup's node implementation tomorrow. Ongoing personal/family issue has been taking up most of my spare time, but it's under control now.

@VikramTiwari comment on different times for different endpoints/data is something that will need to be addressed. Application usage and refresh requests could be used to build a stack for the GCF to pull from so that the most used data gets refreshed first, or extremely infrequently changed data could be refreshed by an admin manually faster than waiting for a cron. We'll just have to talk to meetup about increasing the limit, but header data should be used either way so that should have no affect on the GCF.

In either case, api and data first... scheduling later.

Splaktar commented 7 years ago

Sounds good @BrockMcKean. Any luck on getting the API working?

BrockMcKean commented 7 years ago

@Splaktar I started with the wrong branch and my father's illness butted in again. Have a lot of help now. I'm working on something for Angular Attack, but I'll do a PR to the branch you specified on Sunday. Sorry for the delay.

Splaktar commented 7 years ago

@BrockMcKean OK, no rush to do anything this Sunday, just wanted to check in. Good luck with Angular Attack! I did it last year, but just didn't have time this year.

BrockMcKean commented 7 years ago

@Splaktar I didn't really have time either (this year). Just did it for fun. Didn't finish my submission in time to really have much of a chance. I'm working on this ingestion job now.

BrockMcKean commented 7 years ago

So, as far as I can tell there's two ways of grabbing multiple associated groups at once:

  1. /:urlname which grabs a single group based on the urlname, like gdgspacecoast. The problem with using this approach is simply that the URLs do not have any particular relationship with the name. Lots of urlname with and without -, capital and lower, with prefixes and suffixes that are not present in the name field retrieved from devsite.

  2. Use /pro/:urlname/groups to grab all the groups from https://meetup.com/pro/gdg, but that requires a key with admin privileges to the gdg pro community. So, if we could get a key with admin privileges to the Meetup Pro gdg community... we'd be able to grab all of the groups without knowing their urlname upfront.

However... even if we did that, there is no relational data in common between the devsite Chapter and the meetup Group. Which means, the organizers would have to link their meetup group manually. The only other option would be to scrape https://meetup.com/pro/gdg, which we can do, but... not exactly within the ToS (probably)?

Therefore, instead of trying to ingest all of the data at once, it might make more sense to allow the organizers to link their accounts via oauth and ingest the data at that time. We could hook up to G+'s api to send automated messages about gdg-x/hub features as they are added to try to encourage them to link their accounts and start using hub.gdg-x.io for their meetups.

I'm trying to get an oauth meetup client working in a separate node project and will add it to clients/meetup.js and place the keys in config/keys.js as soon as it's firing properly.

BrockMcKean commented 7 years ago

Quick update: I have BrockMcKean/meetup-test doing an apikey call to get a group by urlname and oauth2 call to get self groups. I'm trying to determine which view to display an "integrate with meetup" button for organizers, and then require (and perhaps revalidate?) google oauth before allowing the meetup oauth. But there's no separate "profile" area for authorized users... .so I'm wondering if this is a good reason to start that view?

@Splaktar @VikramTiwari Does this seem like the correct implementation?

Splaktar commented 7 years ago

@BrockMcKean Authorized users on the Hub UI don't really work very well atm. We should probably tear out the Passport stuff and replace it with Firebase Auth, but that's out of scope for this work.

I can look into getting the Meetup Pro key for the GDG. We would have to make sure that it never got committed to Git.

BrockMcKean commented 7 years ago

Is the hub supposed to be solely for unified GDG information across G+ and Meetup? If so, we should probably try to get a meetup pro community admin key. If it's to provide a superior event service to attendees and organizers, wouldn't users be the root dependency?

Splaktar commented 7 years ago

@BrockMcKean Yes, the Hub is a data hub. The UI is purely for debugging and development use, not for GDG organizers to use regularly. The idea is that projects like Firefly, Boomerang, and Frisbee will use the Hub data APIs to build rich, targeted applications.

Splaktar commented 7 years ago

I wrote up the request for the Meetup Pro Community API Key with a clear justification and it was rejected by Jennifer Kohl within 1 day.

She suggested that we follow the process being used by GDG Beijing member, Jackie Han, to build this reporting tool.

My notes about the process:

So you are scraping the GDG names from https://www.meetup.com/pro/gdg/ which does not give you all of the Global GDGs but at least gives you a set of the most recent Global GDGs to post an event?

I reviewed the Meetup.com API License and the Meetup.com Terms and I can't find any text in there that prohibits web scraping (most sites prohibit this).

Then when you have the names, you are able to use the Meetup.com API to do queries for each chapter by name.

Update from Jackie:

https://www.meetup.com/pro/gdg/ contains all meetup groups in the gdg pro network, including name and link.

What do you guys think about this design? Should we combine scraping of the Meetup.com website with opt-in by chapters (providing their Meetup.com Chapter name)?

My personal feelings are that scraping the website feels very fragile and could lead to both maintenance and policy problems down the road. I feel like requiring opt-in from every global chapter is not ideal at all. That said, I've kicked off a new work project and won't be available to work on this for 4+ months, so it's really up to you guys and what you think you can get working.

pmoosh commented 7 years ago

I didn't realize how the data is generate until now. Editing the attendance on Meetup is a pain in the neck. So far we estimated the numbers, now we have to click up to 200 times per Meetup......

Splaktar commented 7 years ago

@pmoosh that's a bit off-topic, but you don't need to click 200 times. Just click on the 'Checked In' tab and enter a total attendee count. You don't actually need to check in all attendees one by one.

BrockMcKean commented 7 years ago

Scraping is about ingesting data that is immutable or antifragile for a relatively long period of time (+6 months). The primary reason for that is to avoid rewriting the scraper all the time. At least with a versioned API there is plenty of lead time to handle that.

But, Scraping https://www.meetup.com/pro/gdg/ once and using the data gathered to use the existing api is only one point of failure, and not much different from using https://developers.google.com/groups/directorygroups/ to gather group names. Perhaps the UI will change without notice, and perhaps that would cause the hub to have outdated data while a fix was applied, but the UI change would still result in structured, repetitive HTML and so the fix would be relatively minor.

Personally, I don't see the problem with using oauth if, at the end of the day, the data and hub will continue to be used to build other tools for GDG's. Why not just build one of the tools? Does it really need to be an android app? Why not a PWA?

I've also been working on other things and don't plan on having any time for this in the near future. Perhaps towards the end of the July.