georgetown-cset / parat

🦜 PARAT: CSET's Private-sector AI-Related Activity Tracker
https://parat.cset.tech
Other
5 stars 0 forks source link

Use new table of CSET_id to unnormalized org name linkages #315

Closed jmelot closed 2 months ago

jmelot commented 2 months ago

Closes #310

Still needed:

@rggelles this is the continuation of https://github.com/georgetown-cset/parat/pull/304. It still needs some polish, but I think this is a good time to start to get your input. The basic idea of what I'm doing here is replacing get_ai_counts.py and its callers with a series of sql queries that use the new org_name_matches table to link org names to CSET ids.

I want to make sure this method makes sense to both of us. If you're willing, I think a good place to start would be understanding this count discrepancy: https://github.com/georgetown-cset/parat/pull/315/files#r1587841664

If you could just take a look at that query and confirm that the method makes sense to you, I can run the pipeline up to the production copy stage and come back with a longer writeup on diffs between old and new counts.

github-actions[bot] commented 2 months ago

No need for rebasing :+1: behind_count is 0 ahead_count is 18

github-actions[bot] commented 2 months ago

β˜‚οΈ Python Coverage

current status: βœ…

Overall Coverage

Lines Covered Coverage Threshold Status
572 370 65% 0% 🟒

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: 232791f by action🐍

github-actions[bot] commented 2 months ago

JavaScript Coverage

Summary

Lines Statements Branches Functions
Coverage: 67%
67.91% (398/586) 57.78% (193/334) 68.04% (132/194)
Modified Files β€’ (67%)
File% Stmts% Branch% Funcs% LinesUncovered Line #s
All files67.9157.7868.0467.57 
report-only-changed-files is enabled. No files were changed in this commit :)
jmelot commented 2 months ago

@rggelles I updated the regex matching to be case insensitive as we discussed last week and reran.

tl;dr, we either need to do regex search over more or all orgs or expand our ROR id coverage to all organizations. Currently over half of orgs do not have a ROR id (from select count(0) from high_resolution_entities.aggregated_organizations where array_length(ror_id) = 0 = 1179). Eventually all organizations should be linked to ROR (they're doing research, after all) but if that isn't feasible short-term we can add regex search over all orgs.

I'm not sure if the ROR coverage issue is something @za158's annotators are working on/due to some other data "vintage" issue, though (updating the pipeline to use the new airtable base is next on my to-do list). If it is and we expect ROR coverage to increase in the short term we may be able to keep the current method. We might also be able to get higher ROR coverage by joining on the ROR dataset using website or even just normalized name if we haven't already done that.

Anyway, in the below, staging_ai_companies_visualization_old refers to a copy of the old parat staging dataset before the new queries were run (if you look around you may see some new tables there due to me running some experiments before doing this copy...)

ai company papers counts

select count(0) from staging_ai_companies_visualization_old.ai_company_papers = 130,154 select count(0) from staging_ai_companies_visualization.ai_company_papers = 127,942

I used this query to investigate (the idea is to look at all affiliations for a given merged id, so we're looking to see if at least one affiliation matches - we expect there to be ones that don't match in the results):

select old_papers.*, name, org_name, affiliations.ror_id from staging_ai_companies_visualization_old.ai_company_papers as old_papers
left join
staging_ai_companies_visualization.ai_company_papers as new_papers
on 
(new_papers.CSET_id = old_papers.CSET_id) and (new_papers.merged_id = old_papers.merged_id)
left join 
staging_ai_companies_visualization_old.visualization_data 
on old_papers.CSET_id = visualization_data.CSET_id
left join
literature.affiliations
on old_papers.merged_id = affiliations.merged_id
where new_papers.CSET_id is null
order by merged_id 
limit 1000

Eyeballing these results, several of these omissions seem correct, but there's an obvious issue (highlighted)

Screenshot 2024-05-06 at 9 09 06 AM

This is due to this org not having a ror assigned in high_resolution_entities, but having a ror in affiliations, and so getting excluded from regex matching but also not getting matched using ror id.

select * from high_resolution_entities.aggregated_organizations where CSET_id = 845 - no ror

ai patent counts

These increased

select count(0) from staging_ai_companies_visualization_old.ai_company_patents = 151,423 select count(0) from staging_ai_companies_visualization.ai_company_patents = 174,773

select count(0) from staging_ai_companies_visualization_old.ai_company_patent_grants = 69,583 select count(0) from staging_ai_companies_visualization.ai_company_patent_grants = 80,818

There are some missing links, again apparently due to ROR ids missing, from eyeballing

select old_patents.*, name, assignee, assignees_normalized.ror_id from staging_ai_companies_visualization_old.ai_company_patents as old_patents
left join
staging_ai_companies_visualization.ai_company_patents as new_patents
on 
(new_patents.CSET_id = old_patents.CSET_id) and (new_patents.family_id = old_patents.family_id)
left join 
staging_ai_companies_visualization_old.visualization_data 
on old_patents.CSET_id = visualization_data.CSET_id
left join
unified_patents.assignees_normalized
on old_patents.family_id = assignees_normalized.family_id
where new_patents.CSET_id is null
order by family_id 
limit 1000
Screenshot 2024-05-06 at 9 18 31 AM Screenshot 2024-05-06 at 9 19 04 AM

But eyeballing some new affiliations, they look ok

select new_patents.*, name, assignee, assignees_normalized.ror_id from staging_ai_companies_visualization.ai_company_patents as new_patents
left join
staging_ai_companies_visualization_old.ai_company_patents as old_patents
on 
(new_patents.CSET_id = old_patents.CSET_id) and (new_patents.family_id = old_patents.family_id)
left join 
staging_ai_companies_visualization_old.visualization_data 
on new_patents.CSET_id = visualization_data.CSET_id
left join
unified_patents.assignees_normalized
on new_patents.family_id = assignees_normalized.family_id
where old_patents.CSET_id is null
order by family_id 
limit 1000
Screenshot 2024-05-06 at 9 20 28 AM

I also eyeballed the old and new versions of the final all_visualization_data for Google and Hugging Face. Both lost a lot of publications. HF has no ROR id in high_resolution_entities and this seems to be the source of the issue.

Next steps

Let me know what you want to do orgs we haven't assigned RORs to in PARAT. Once that's settled one way or another, I'll rerun and hand this over to you for further checking if that works?

rggelles commented 2 months ago

So, to clarify here: we fully annotated every single ROR organization in PARAT.

The gaps you are seeing here are because it has been 1.5 years since our annotation efforts completed, and ROR has added new organizations in that time. For example, Hugging Face did not have a ROR when we annotated; according to ROR it looks most likely it was created 2022-10-27, which is a month or two after we had wrapped up our annotation. I honestly have no idea what could have gone wrong with Synopsys, because it absolutely has GRIDs (4!) in the annotation table, and those GRIDs are linked to ROR in the ROR table, so I have no idea why it would not have ROR in high_resolution_entities; some error is happening here, but it's not an annotation problem.

I don't know if there's an easy way from your end to see the organizations whose paper totals dropped the most between the two versions, but if so, it seems like the easiest solution would be to see if there's a small set of those and if those are organizations that acquired ROR in the past 1.5 years (or had whatever weird linking problem Synopsys is having, which we should get fixed). Then we could add RORs for just those, and solve this. I doubt the number is that high? But I suppose I could be wrong.

Otherwise, we could just switch to running over everything, if this is too difficult or time-consuming.

jmelot commented 2 months ago

Interesting. I just did select * from high_resolution_entities.aggregated_organizations where array_length(ror_id) = 0 and three of the first 10 records that popped up for me (e.g https://ror.org/04g9yb321) had ROR ids circa 2021. The other 7 weren't obviously in ror (i.e. I didn't immediately see a matching org when I searched in the ROR website). We should submit all missing PARAT orgs to ROR (ETA: that have publications), maybe after PARAT v2 finally gets out the door. I'm familiar with the process for doing this in bulk from doing it with James, Daniel, and Zach on another dataset so happy to do it, or support if you'd like to do it.

Regardless, I think we should either commit to a process that matches each org to ROR ids in the near-term and going forward, or switch back to doing regular expression search over all orgs regardless of whether they have ROR ids.

I assume you're not able to support further ROR matching, so I will discuss that with Zach. In the meantime, would you be able to take a look at why we're failing to match some of these orgs that were apparently in ROR in 2021, just in case it's symptomatic of some other issue? Once I confirm that we're not going to have comprehensive ROR matches in the near term I'll switch the query over to matching all orgs and can ping you again to review once the data is updated.

za158 commented 2 months ago

Just noting that many PARAT orgs will have 0 research publications, so it makes sense that they lack ROR/GRID - and for the same reason I'm not sure ROR will take them...

Zachary Arnold Analytic Lead, Emerging Technology Observatory (web http://eto.tech | LinkedIn https://www.linkedin.com/company/emerging-technology-observatory | Substack http://etoblog.substack.com) Center for Security and Emerging Technology @.***

On Mon, May 6, 2024 at 3:35β€―PM Jennifer Melot @.***> wrote:

Interesting. I just did select * from high_resolution_entities.aggregated_organizations where array_length(ror_id) = 0 and three of the first 10 records that popped up for me (e.g https://ror.org/04g9yb321) had ROR ids circa 2021. The other 7 weren't obviously in ror (i.e. I didn't immediately see a matching org when I searched in the ROR website). We should submit all missing PARAT orgs to ROR, maybe after PARAT v2 finally gets out the door. I'm familiar with the process for doing this in bulk from doing it with James, Daniel, and Zach on another dataset so happy to do it, or support if you'd like to do it.

Regardless, I think we should either commit to a process that matches each org to ROR ids in the near-term and going forward, or switch back to doing regular expression search over all orgs regardless of whether they have ROR ids.

I assume you're not able to support further ROR matching, so I will discuss that with Zach. In the meantime, would you be able to take a look at why we're failing to match some of these orgs that were apparently in ROR in 2021, just in case it's symptomatic of some other issue? Once I confirm that we're not going to have comprehensive ROR matches in the near term I'll switch the query over to matching all orgs and can ping you again to review once the data is updated.

β€” Reply to this email directly, view it on GitHub https://github.com/georgetown-cset/parat/pull/315#issuecomment-2096766070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOAZR6ZVNIBZFVUBA4KQPNLZA7LQTAVCNFSM6AAAAABHCRNX2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJWG43DMMBXGA . You are receiving this because you were mentioned.Message ID: @.***>

jmelot commented 2 months ago

Yeah I should have been more specific and said "that have publications"

jmelot commented 2 months ago

Anyway, Zach's annotators haven't been adding RORs, so I think we should just fall back to doing regex search for all orgs if we don't have a way of keeping the ROR ids up to date right now. I do think it would still be a good thing to figure out why Synopsys etc don't have RORs though, mostly to make sure it isn't symptomatic of something else. But if you approve @rggelles I'll just switch over to doing regex search on everything

rggelles commented 2 months ago

There does appear to be some linkage issue, because West Pharmaceutical (the example you linked) again has GRID in our annotation table. Just like Synopsys. I'd guess this is a problem with the GRID-ROR linkage code; I can look into this. But there's no annotation issue here; all of these have been annotated fully, and ones that are lacking codes in our annotation table almost definitely either don't have codes or had codes added in the past 1.5 years.

As @za158 says, many many PARAT orgs won't have codes, and ROR won't take them. They don't have any paper output, so they're not ROR orgs. They're still interesting to us though; they have patents, AI workers, etc. That's one of the primary reasons we moved to include regex in the first place.

jmelot commented 2 months ago

Gotcha - but fyi Zach had the good idea to just count the number of orgs with pubs but no rors and it's 845 from SELECT count(0) FROM ai_companies_visualization.all_visualization_data where array_length(ror_id) = 0 and all_pubs > 0

jmelot commented 2 months ago

That's presumably a combination of the ROR circa 1.5 years ago coverage issue + the Synopsis issue + current ROR coverage issues in some unknown proportions. Thanks for looking at the Synopsis issue. In the meantime, I'll switch over to doing regex searches for all orgs

rggelles commented 2 months ago

Okay, to fix the Synopsys issue, what needs to be done is the following:

Go here and add a TRIM() around grid_joined.grid.

This will solve the fact that some of the GRID annotation fields apparently must have ended up with spaces in them πŸ€¦πŸ»β€β™€οΈ

jmelot commented 2 months ago

Ok thanks - I made that change and also run regex on everything now. select count(0) from staging_ai_companies_visualization.ai_company_papers = 142,992 which makes more sense since the number of papers increased since this was last run. @rggelles can you take the review from here?