georgetown-cset / parat

🦜 PARAT: CSET's Private-sector AI-Related Activity Tracker
https://parat.cset.tech
Other
5 stars 0 forks source link

Consolidate company descriptions into consolidated base and integrate into data #275

Closed za158 closed 4 months ago

za158 commented 6 months ago

Move https://airtable.com/appwussU2z2a8jzmy/tblwblqKY0jyu82At/viw2QzkAGOFMKkyZl?blocks=hide into https://airtable.com/appqtfVHf5tIRsfkI/tbl12gdOy2YSf0whH/viwQlYk4DOCtDp0o1?blocks=hide, add data currently not in airtable (descriptions for PARAT v1 companies, I guess), figure out what's missing

jmelot commented 4 months ago

Ok I just reminded myself of the current state of things:

Looking at the above, what I understand is that we want to take the company descriptions (and no other information) from https://airtable.com/appwussU2z2a8jzmy/tblwblqKY0jyu82At/viw2QzkAGOFMKkyZl?blocks=hide and add it to the "Main" view in https://airtable.com/appqtfVHf5tIRsfkI/tbl12gdOy2YSf0whH/viwQlYk4DOCtDp0o1?blocks=hide .

To do this, unless I hear that something else would be better (open to other ideas, just making a suggestion), I will add four new columns to the consolidated base:

And populate these with metadata from crunchbase if available, then wikipedia, then the company sites. I will then update the pipeline and data preprocessing scripts to read the descriptions from airtable/bigquery. Does that work for you?

za158 commented 4 months ago

Seems basically fine, except for tidiness' sake I would prefer to start a new "Descriptions" table with those four columns plus one to identify the company. OK if I go ahead and set that table up, and then you can populate it?

jmelot commented 4 months ago

That's ok. I will say that I find separate tables for metadata that should have a 1-1 mapping to companies a bit weird (i.e. companies shouldn't have multiple descriptions), but maybe this makes your process easier somehow - in any case, I can work with it.

za158 commented 4 months ago

It's just slightly easier to work with in the Airtable web interface, honestly. Not a huge deal either way

jmelot commented 4 months ago

Ok let's do it your way then, nbd

jmelot commented 4 months ago

@za158 I populated the table with what data we have.

Notes to self - I imported the old annotator sheet from my local csv into tmp.supplemental_company_descriptions and the new data from https://airtable.com/appwussU2z2a8jzmy/tblwblqKY0jyu82At/viw2QzkAGOFMKkyZl?blocks=hide into tmp.supplemental_company_descriptions_new, ran

with old_desc as (
  select
    company_name,
    wikipedia_description,
    wikipedia_description_link,
    company_description,
    company_description_link,
    if(wikipedia_description is null, if(company_description is null, null, Description_retrieval_date__company_page_), Description_retrieval_date__Wikipedia_) as retrieval_date
  from
    tmp.supplemental_company_descriptions_new
  union all
  select
    company_name,
    wikipedia_description,
    wikipedia_description_link,
    company_description,
    company_description_link,
    retrieval_date
  from
    tmp.supplemental_company_descriptions
  where company_name not in (select company_name from tmp.supplemental_company_descriptions_new)
)
SELECT
  airtable_id as Organization,
  coalesce(short_description, wikipedia_description, company_description) as description,
  if(short_description is null,
    if(wikipedia_description is null,
      if(company_description is null, null, "company"),
    "wikipedia"),
  "crunchbase") as description_source,
  if(short_description is null,
    if(wikipedia_description is null,
      if(company_description is null, null, company_description_link),
    wikipedia_description_link),
  crunchbase.crunchbase_url) as description_link,
  if(short_description is null,
    if(wikipedia_description is null,
      if(company_description is null, null, retrieval_date),
    retrieval_date),
  "2024-05-16") as retrieval_date,
FROM
  ai_companies_visualization.all_visualization_data
left join
  old_desc
on lower(name) = lower(company_name)
left join
  `gcp-cset-projects.staging_airtable_to_bq.000_main_raw`
using (cset_id)

and downloaded the results locally. I then updated airtable using this script

import argparse
import csv
import requests

from more_itertools import batched

def insert_into_airtable(data: list, token: str) -> None:
    """
    Inserts a list of data into airtable
    :param data: List of data (as dicts) to insert
    :param token: Airtable access token
    :return: None
    """
    headers = {"Authorization": f"Bearer {token}"}
    for elt in data:
        elt["Organization"] = [elt["Organization"]]
        for field in elt:
            if not elt[field]:
                elt[field] = None
    reformatted_data = {"records": [{"fields": elt} for elt in data]}
    result = requests.post(
        f"https://api.airtable.com/v0/appqtfVHf5tIRsfkI/Description",
        json=reformatted_data,
        headers=headers,
    )
    if result.status_code != 200:
        print(result.text)
        raise ValueError(f"Unexpected status code: {result.status_code}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_file")
    args = parser.parse_args()

    with open(args.input_file) as f:
        for batch in batched(csv.DictReader(f), 10):
            insert_into_airtable(batch, <access token>)

I will now work on updating the preprocessing script to make use of the new airtable table - in the meantime your annotators should be good to fill in the gaps