CouncilDataProject / cdp-scrapers

Scratchpad for scraper development and general utilities.
https://councildataproject.org/cdp-scrapers
Mozilla Public License 2.0
24 stars 16 forks source link

Washington, DC events scraper #75

Open dphoria opened 2 years ago

dphoria commented 2 years ago

Feature Description

A clear and concise description of the feature you're requesting.

Provide a file in cdp_scrapers/instances/ like cdp_scrapers/instances/dc.py or something similar that provides a function that implements API to return Washington, DC city council meetings as List[EventIngestionModel] for a period of time, e.g.

get_events(begin: Optional[datetime] = None, end: Optional[datetime] = None) -> List[EventIngestionModel]

Use Case

Please provide a use case to help us understand your request in context.

Above file and API would be used in deploying a CDP instance for Washington, DC.

dphoria commented 2 years ago

Here is a very good document prepared by @AZIZXlaouiti . https://docs.google.com/document/d/1vFXOAFsGK5AOCbfcvt-LwFR71IUiTkubwa8EcBwgbHQ

cc @Shak2000

dphoria commented 2 years ago

I have hit a road block. Can't quite to come up with a clean way to get events for a given period of time. So far to me, the main candidates for information sources are

  1. https://dccouncil.us/events/list/?tribe_event_display=past&tribe_paged=1
  2. http://dc.granicus.com/viewpublisher.php?view_id=2
  3. https://lims.dccouncil.us/api/help/index.html

I have yet to figure out how to make a query with a period time as parameter(s) into any of the above sources. 1 and 2 can be used to retrieve the most recent N meetings.

3, an API, seemed like a good choice going in, but I am not so warm to it now. It is a good resource to get information about specific bills, laws, etc. It, to me, is almost useless to get agendas, and so on about meetings.

evamaxfield commented 2 years ago

What about this? https://dccouncil.us/events/2022-01/

You can fill in any year and month and then find the day in the calendar?

dphoria commented 2 years ago

What about this? https://dccouncil.us/events/2022-01/

You can fill in any year and month and then find the day in the calendar?

Oh man why didn't I think of this approach! I even saw that calendar before LOL. Yes I think this is, at least to me, the best route I've seen thus far. :+1: Awesome.

dphoria commented 2 years ago

Finally got around to making a first draft. Just getting the minimal now. https://gist.github.com/dphoria/7bea514b1a201f33ade2cf8c8d9fa707 Made a stand-alone file for now for easier development and testing.

import washington_dc
from datetime import datetime
washington_dc.get_events_on_date(datetime(2022, 2, 1))
[
    EventIngestionModel(
        body=Body(name='Committee of the Whole', is_active=True, start_datetime=None, description=None, end_datetime=None, external_source_id=None),
        sessions=[
            Session(
                session_datetime=datetime.datetime(2022, 2, 1, 12, 0),
                video_uri='http://archive-media.granicus.com:443/OnDemand/dc/dc_2bc5049c-4415-4cbe-a069-35623328a371.mp4',
                session_index=0,
                caption_uri='https://dc.granicus.com/TranscriptViewer.php?view_id=4&clip_id=7039',
                external_source_id=None,
            ),
        ],
        event_minutes_items=None,
        agenda_uri='https://dccouncil.us/wp-content/uploads/2022/01/2.1.22-COW-Agenda_ADDITIONAL-1.pdf',
        minutes_uri=None,
        static_thumbnail_uri=None,
        hover_thumbnail_uri=None,
        external_source_id=None,
    ),
    EventIngestionModel(
        body=Body(name='City Council', is_active=True, start_datetime=None, description=None, end_datetime=None, external_source_id=None),
        sessions=[
            Session(
                session_datetime=datetime.datetime(2022, 2, 1, 13, 0),
                video_uri='http://archive-media.granicus.com:443/OnDemand/dc/dc_dc26ab8b-ac05-48cd-968e-94ba67282a87.mp4',
                session_index=0,
                caption_uri='https://dc.granicus.com/TranscriptViewer.php?view_id=3&clip_id=7040',
                external_source_id=None,
            ),
        ],
        event_minutes_items=None,
        agenda_uri='https://dccouncil.us/wp-content/uploads/2021/12/February-1-2022-Legislative-Meeting-2.pdf',
        minutes_uri=None,
        static_thumbnail_uri=None,
        hover_thumbnail_uri=None,
        external_source_id=None,
    ),
]
dphoria commented 2 years ago

Foremost question in my head is best way to get votes. I think https://lims.dccouncil.us/ https://lims.dccouncil.us/api/help/index.html using information parsed from an event page like https://dccouncil.us/event/legislative-meeting-86/

dphoria commented 2 years ago

What is highly disappointing is that I thought DC used to have an event's minute items listed in the lower left table on their video player. That seems to be no longer the case?

e.g. On http://dc.granicus.com/ViewPublisher.php?view_id=3, click on any "Video" link on the right. The popup is largely empty with just the video. That used to have a lot of useful information we could have used to get EventMinutesItem, etc.

AZIZXlaouiti commented 2 years ago

@dphoria i did notice that along with the absence of pdf document and sometimes captions aren't available

evamaxfield commented 2 years ago

Nice job!!

Can't comment on the PDF document but I wouldn't worry if the captions are optionally available. Seattle has captions for roughly 95% of meetings. If captions aren't available we will roll back to Google. No worries.

Excited to see this progress!!

dphoria commented 2 years ago

Any luck in adding to the scraper, @AZIZXlaouiti ? I've been working on other issues recently; probably will be for another couple more weeks. After that I may be able to hop back on this if necessary. Anyway just wanted to check in.

AZIZXlaouiti commented 2 years ago

@dphoria i had some busy weeks (family / interview) related so i wasn't active as i wanted to be but i will resume the work this week . My apologies.

AZIZXlaouiti commented 2 years ago

@dphoria i managed to get the event_minutes added . i parsed the pdf from agenda_uri and managed to get all the legistlation_number after that i'll have to use lims api to get the votes/ votes status /persons.

AZIZXlaouiti commented 2 years ago

https://gist.github.com/AZIZXlaouiti/b3b0ccab24a1fbd0586fb8756fc85c1c

[
   EventIngestionModel(body=Body("name=""Committee of the Whole",
   "is_active=True",
   "start_datetime=None",
   "description=None",
   "end_datetime=None",
   "external_source_id=None)",
   "sessions="[
      Session(session_datetime=datetime.datetime(2022, 2 , 1, 12 ,0),
      "video_uri=""http://archive-media.granicus.com:443/OnDemand/dc/dc_2bc5049c-4415-4cbe-a069-35623328a371.mp4",
      session_index=0,
      "caption_uri=""https://dc.granicus.com/TranscriptViewer.php?view_id=4&clip_id=7039",
      "external_source_id=None)"
   ],
   "event_minutes_items="[
      "EventMinutesItem(minutes_item=MinutesItem(name=""Bill 24-117",
      "description=None",
      "external_source_id=None)",
      "index=None",
      "matter=Matter(name=""B24-0117",
      "matter_type=None",
      "title=""Armstead Barnett Way Designation Act of 2021",
      "result_status=None",
      "sponsors=None",
      "external_source_id=None)",
      "supporting_files=None",
      "decision=None",
      "votes=None)",
   ],
   "agenda_uri=""https://dccouncil.us/wp-content/uploads/2022/01/2.1.22-COW-Agenda_ADDITIONAL-1.pdf",
   "minutes_uri=None",
   "static_thumbnail_uri=None",
   "hover_thumbnail_uri=None",
   "external_source_id=None)",
   "EventIngestionModel(body=Body(name=""City Council",
   "is_active=True",
   "start_datetime=None",
   "description=None",
   "end_datetime=None",
   "external_source_id=None)",
   "sessions="[
      Session(session_datetime=datetime.datetime(2022, 2, 1, 13 ,0),
      "video_uri=""http://archive-media.granicus.com:443/OnDemand/dc/dc_dc26ab8b-ac05-48cd-968e-94ba67282a87.mp4",
      session_index=0,
      "caption_uri=""https://dc.granicus.com/TranscriptViewer.php?view_id=3&clip_id=7040",
      "external_source_id=None)"
   ],
   "event_minutes_items="[
      "EventMinutesItem(minutes_item=MinutesItem(name=""CER 24-125",
      "description=None",
      "external_source_id=None)",
      "index=None",
      "matter=Matter(name=""CER24-0125",
      "matter_type=None",
      "title=""Beverly Odoms-Johnson Posthumous Recognition Ceremonial Resolution of 2022",
      "result_status=None",
      "sponsors=None",
      "external_source_id=None)",
      "supporting_files=None",
      "decision=None",
      "votes=None)",
   ],
   "agenda_uri=""https://dccouncil.us/wp-content/uploads/2021/12/February-1-2022-Legislative-Meeting-2.pdf",
   "minutes_uri=None",
   "static_thumbnail_uri=None",
   "hover_thumbnail_uri=None",
   "external_source_id=None)"
]
dphoria commented 2 years ago

@dphoria i had some busy weeks (family / interview) related so i wasn't active as i wanted to be but i will resume the work this week . My apologies.

No absolutely no need for any apologies. :smile: I was just curious.