Closed za158 closed 4 months ago
Ok I just reminded myself of the current state of things:
Looking at the above, what I understand is that we want to take the company descriptions (and no other information) from https://airtable.com/appwussU2z2a8jzmy/tblwblqKY0jyu82At/viw2QzkAGOFMKkyZl?blocks=hide and add it to the "Main" view in https://airtable.com/appqtfVHf5tIRsfkI/tbl12gdOy2YSf0whH/viwQlYk4DOCtDp0o1?blocks=hide .
To do this, unless I hear that something else would be better (open to other ideas, just making a suggestion), I will add four new columns to the consolidated base:
description_source
(one of "crunchbase", "wikipedia", "company")description_link
description_retrieved
(when we retrieved the description)description
And populate these with metadata from crunchbase if available, then wikipedia, then the company sites. I will then update the pipeline and data preprocessing scripts to read the descriptions from airtable/bigquery. Does that work for you?
Seems basically fine, except for tidiness' sake I would prefer to start a new "Descriptions" table with those four columns plus one to identify the company. OK if I go ahead and set that table up, and then you can populate it?
That's ok. I will say that I find separate tables for metadata that should have a 1-1 mapping to companies a bit weird (i.e. companies shouldn't have multiple descriptions), but maybe this makes your process easier somehow - in any case, I can work with it.
It's just slightly easier to work with in the Airtable web interface, honestly. Not a huge deal either way
Ok let's do it your way then, nbd
@za158 I populated the table with what data we have.
Notes to self - I imported the old annotator sheet from my local csv into tmp.supplemental_company_descriptions
and the new data from https://airtable.com/appwussU2z2a8jzmy/tblwblqKY0jyu82At/viw2QzkAGOFMKkyZl?blocks=hide into tmp.supplemental_company_descriptions_new
, ran
with old_desc as (
select
company_name,
wikipedia_description,
wikipedia_description_link,
company_description,
company_description_link,
if(wikipedia_description is null, if(company_description is null, null, Description_retrieval_date__company_page_), Description_retrieval_date__Wikipedia_) as retrieval_date
from
tmp.supplemental_company_descriptions_new
union all
select
company_name,
wikipedia_description,
wikipedia_description_link,
company_description,
company_description_link,
retrieval_date
from
tmp.supplemental_company_descriptions
where company_name not in (select company_name from tmp.supplemental_company_descriptions_new)
)
SELECT
airtable_id as Organization,
coalesce(short_description, wikipedia_description, company_description) as description,
if(short_description is null,
if(wikipedia_description is null,
if(company_description is null, null, "company"),
"wikipedia"),
"crunchbase") as description_source,
if(short_description is null,
if(wikipedia_description is null,
if(company_description is null, null, company_description_link),
wikipedia_description_link),
crunchbase.crunchbase_url) as description_link,
if(short_description is null,
if(wikipedia_description is null,
if(company_description is null, null, retrieval_date),
retrieval_date),
"2024-05-16") as retrieval_date,
FROM
ai_companies_visualization.all_visualization_data
left join
old_desc
on lower(name) = lower(company_name)
left join
`gcp-cset-projects.staging_airtable_to_bq.000_main_raw`
using (cset_id)
and downloaded the results locally. I then updated airtable using this script
import argparse
import csv
import requests
from more_itertools import batched
def insert_into_airtable(data: list, token: str) -> None:
"""
Inserts a list of data into airtable
:param data: List of data (as dicts) to insert
:param token: Airtable access token
:return: None
"""
headers = {"Authorization": f"Bearer {token}"}
for elt in data:
elt["Organization"] = [elt["Organization"]]
for field in elt:
if not elt[field]:
elt[field] = None
reformatted_data = {"records": [{"fields": elt} for elt in data]}
result = requests.post(
f"https://api.airtable.com/v0/appqtfVHf5tIRsfkI/Description",
json=reformatted_data,
headers=headers,
)
if result.status_code != 200:
print(result.text)
raise ValueError(f"Unexpected status code: {result.status_code}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--input_file")
args = parser.parse_args()
with open(args.input_file) as f:
for batch in batched(csv.DictReader(f), 10):
insert_into_airtable(batch, <access token>)
I will now work on updating the preprocessing script to make use of the new airtable table - in the meantime your annotators should be good to fill in the gaps
Move https://airtable.com/appwussU2z2a8jzmy/tblwblqKY0jyu82At/viw2QzkAGOFMKkyZl?blocks=hide into https://airtable.com/appqtfVHf5tIRsfkI/tbl12gdOy2YSf0whH/viwQlYk4DOCtDp0o1?blocks=hide, add data currently not in airtable (descriptions for PARAT v1 companies, I guess), figure out what's missing