DataKind-DC / capital-nature-ingest

Scripts for ingesting data for Capital Nature
22 stars 27 forks source link

BUG: Audubon Naturalist Society scraper not pulling all events #246

Closed akaahanui closed 3 years ago

akaahanui commented 3 years ago

Expected Behavior The Audubon Naturalist Society scraper not pulling all events. Please scrape their event calendar: https://anshome.org/events-calendar/

Current Behavior Fraction of their events being pulled with existing scraper. We are current through September, so this would be a fix for Oct moving forward.

Possible Solution Scrape the event calendar.

Context This is an important event source for CN, so prioritize.

csmcallister commented 3 years ago

This event scraper will automatically attempt to go out 3 months from the current month to get events (e.g. if today is September, it will try to scrape October, November and December). The issue with the missing events is that they don't have clearly listed venues, which is a required field. The scrapers automatically kick out events missing required fields in order to prevent problems when y'all upload the data in CSV format. Two examples of events with hard-to-discern or missing venue info are this one and this one.

For other event sources such as Casey Trees, Arlington County and Fairfax County, we use a dummy value for the venue if we cannot find one scraping: "See event website". @akaahanui would you like to use See event website for those cases, like the two linked above, where scraping the event venue isn't feasible?

akaahanui commented 3 years ago

Thanks Scott! I like your suggestion. I have forwarded to Stella and I'll get back to you tomorrow.

With Aloha, Ana [image: ]

Capital Nature http://www.capitalnature.org/ Follow us on Facebook https://www.facebook.com/capitalnaturedc/ Submit an event https://docs.google.com/forms/d/e/1FAIpQLSfeTlhA7VbpGqVcIEeujEArEbF7LNyyQ0TkF5dmjA126TbCOQ/viewform for our calendar!

On Tue, Sep 8, 2020 at 9:45 PM Scott McAllister notifications@github.com wrote:

This event scraper will automatically attempt to go out 3 months from the current month to get events (e.g. if today is September, it will try to scrape October, November and December). The issue with the missing events is that they don't have clearly listed venues, which is a required field. The scrapers automatically kick out events missing required fields in order to prevent problems when y'all upload the data in CSV format. Two examples of events with hard-to-discern or missing venue info are this one https://anshome.org/events/hoa-advocacy-101-nov2020/ and this one https://anshome.org/events/thrive-2050-at-ans-2/.

For other event sources such as Casey Trees, Arlington County and Fairfax County, we use a dummy value for the venue if we cannot find one scraping: "See event website". @akaahanui https://github.com/akaahanui would you like to use See event website for those cases, like the two linked above, where scraping the event venue isn't feasible?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataKind-DC/capital-nature-ingest/issues/246#issuecomment-689247019, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALMMIHFXMNBZ6RAD7WZ72QDSE3M2ZANCNFSM4RAARYNQ .

csmcallister commented 3 years ago

@akaahanui any word from Stella?

akaahanui commented 3 years ago

@csmcallister HI Scott - I haven't heard, so let's proceed as you suggested. Thanks!