georgetown-cset / parat

🦜 PARAT: CSET's Private-sector AI-Related Activity Tracker
https://parat.cset.tech
Other
5 stars 0 forks source link

Add high-profile AI-focused companies #114

Closed za158 closed 8 months ago

za158 commented 1 year ago

There are maybe 2 dozen high-profile AI companies that may not be in the dataset already (eg because they are not publicly traded and were not on the scene when initial PARAT lists were pulled together way back when). Anthropic is one example (I'm guessing.) We need to add them. Zach can do the grunt work.

za158 commented 1 year ago

@jmelot and @rggelles can you help me figure out which of these companies are not covered yet in PARAT, and for those that are not, what information I should provide/in what format to make sure they can be incorporated?

OpenAI Anthropic Stability AI Cohere Inflection AI Character.ai Midjourney Adept Synthesia Aleph Alpha AI21 Labs HuggingFace Zhipu AI Baichuan Intelligence Xiaopeng Motors BYD Dahua Technology Dark Side of the Moon Kunlun Tech Enflame Shield AI Mistral

rggelles commented 11 months ago

I believe only Hugging Face, Dahua Technology, and Shield AI are currently covered.

As for adding, what you'll need to do is add them into the Airtable; I suggest adding them into preannotation. You'll also need to give them a CSET_id if they don't already have one, or use their current one if they do -- you can use the following query to find the current one if it exists:

SELECT * FROM `gcp-cset-projects.ai_companies_draft_052020.core` where regexp_contains(name, r'(?i)anthropic')

They'll need to get added into each of the subtables in preannotation as well, except the github table, which obviously isn't in use. I'd add them to the bgov table (with a blank entry) even though that isn't in use. The one sticking point here is that we still have a grid table and I'm just converting everything to rors in BigQuery, but adding grids isn't really easy or logical anymore given we don't have a grid table.

Given this, there's basically three options:

  1. We add a ror table or something and I add a second stage to the process of finding rors where I pull from that table and merge it with the rors found in the grid table or
  2. You use the grid column in the ror table to find linked grids instead of. Downside here is that it's possible that after a year or two of divergence some rors may not have linked grids, plus this is continuing to use an outdated id that we no longer care about when we should probably update
  3. We send our new rors back to airtable now that we have them and eliminate the grid table entirely and you update that instead.

My general inclination is which of these we do depends on how likely we are to keep updating this table long-term. That is, if we intend to regularly add companies here or update our current company data, we definitely want to do (1) or (3) because updating with old grids makes no sense, but (2) will save us a lot of time if we don't. (3) will probably take us slightly more time than (1) in the short-term (and perhaps more importantly for you will require me to do something before you can get started on this part of the annotation, although you can leave adding rors for last and finish everything else for the companies) but is likely better in the long-term if we want to do lots of updating since it allows us to modify current company data rather than just adding new company data. I'm open to whichever one you think is best workflow-wise.

za158 commented 11 months ago

1 seems fine for now, I'll just do that. In terms of maintaining this over the longer term (ie switching to 3 or something like it), I think we should revisit along with other org ER discussions in the new year - the bigger issue is integrating the PARAT entity resolution system with all the other ER stuff going on in the merged corpus, etc, which could theoretically affect the approach taken here. cc @jmelot

za158 commented 8 months ago

Tracking elsewhere now