Closed za158 closed 8 months ago
@jmelot and @rggelles can you help me figure out which of these companies are not covered yet in PARAT, and for those that are not, what information I should provide/in what format to make sure they can be incorporated?
OpenAI Anthropic Stability AI Cohere Inflection AI Character.ai Midjourney Adept Synthesia Aleph Alpha AI21 Labs HuggingFace Zhipu AI Baichuan Intelligence Xiaopeng Motors BYD Dahua Technology Dark Side of the Moon Kunlun Tech Enflame Shield AI Mistral
I believe only Hugging Face, Dahua Technology, and Shield AI are currently covered.
As for adding, what you'll need to do is add them into the Airtable; I suggest adding them into preannotation. You'll also need to give them a CSET_id if they don't already have one, or use their current one if they do -- you can use the following query to find the current one if it exists:
SELECT * FROM `gcp-cset-projects.ai_companies_draft_052020.core` where regexp_contains(name, r'(?i)anthropic')
They'll need to get added into each of the subtables in preannotation as well, except the github table, which obviously isn't in use. I'd add them to the bgov table (with a blank entry) even though that isn't in use. The one sticking point here is that we still have a grid table and I'm just converting everything to rors in BigQuery, but adding grids isn't really easy or logical anymore given we don't have a grid table.
Given this, there's basically three options:
My general inclination is which of these we do depends on how likely we are to keep updating this table long-term. That is, if we intend to regularly add companies here or update our current company data, we definitely want to do (1) or (3) because updating with old grids makes no sense, but (2) will save us a lot of time if we don't. (3) will probably take us slightly more time than (1) in the short-term (and perhaps more importantly for you will require me to do something before you can get started on this part of the annotation, although you can leave adding rors for last and finish everything else for the companies) but is likely better in the long-term if we want to do lots of updating since it allows us to modify current company data rather than just adding new company data. I'm open to whichever one you think is best workflow-wise.
1 seems fine for now, I'll just do that. In terms of maintaining this over the longer term (ie switching to 3 or something like it), I think we should revisit along with other org ER discussions in the new year - the bigger issue is integrating the PARAT entity resolution system with all the other ER stuff going on in the merged corpus, etc, which could theoretically affect the approach taken here. cc @jmelot
Tracking elsewhere now
There are maybe 2 dozen high-profile AI companies that may not be in the dataset already (eg because they are not publicly traded and were not on the scene when initial PARAT lists were pulled together way back when). Anthropic is one example (I'm guessing.) We need to add them. Zach can do the grunt work.