Closed jmelot closed 2 months ago
No need for rebasing :+1: behind_count is 0 ahead_count is 18
current status: β
Overall Coverage
Lines Covered Coverage Threshold Status 572 370 65% 0% π’ New Files
No new covered files...
Modified Files
No covered modified files...
updated for commit:
232791f
by actionπ
@rggelles I updated the regex matching to be case insensitive as we discussed last week and reran.
tl;dr, we either need to do regex search over more or all orgs or expand our ROR id coverage to all organizations. Currently over half of orgs do not have a ROR id (from select count(0) from high_resolution_entities.aggregated_organizations where array_length(ror_id) = 0
= 1179). Eventually all organizations should be linked to ROR (they're doing research, after all) but if that isn't feasible short-term we can add regex search over all orgs.
I'm not sure if the ROR coverage issue is something @za158's annotators are working on/due to some other data "vintage" issue, though (updating the pipeline to use the new airtable base is next on my to-do list). If it is and we expect ROR coverage to increase in the short term we may be able to keep the current method. We might also be able to get higher ROR coverage by joining on the ROR dataset using website or even just normalized name if we haven't already done that.
Anyway, in the below, staging_ai_companies_visualization_old
refers to a copy of the old parat staging dataset before the new queries were run (if you look around you may see some new tables there due to me running some experiments before doing this copy...)
select count(0) from staging_ai_companies_visualization_old.ai_company_papers
= 130,154
select count(0) from staging_ai_companies_visualization.ai_company_papers
= 127,942
I used this query to investigate (the idea is to look at all affiliations for a given merged id, so we're looking to see if at least one affiliation matches - we expect there to be ones that don't match in the results):
select old_papers.*, name, org_name, affiliations.ror_id from staging_ai_companies_visualization_old.ai_company_papers as old_papers
left join
staging_ai_companies_visualization.ai_company_papers as new_papers
on
(new_papers.CSET_id = old_papers.CSET_id) and (new_papers.merged_id = old_papers.merged_id)
left join
staging_ai_companies_visualization_old.visualization_data
on old_papers.CSET_id = visualization_data.CSET_id
left join
literature.affiliations
on old_papers.merged_id = affiliations.merged_id
where new_papers.CSET_id is null
order by merged_id
limit 1000
Eyeballing these results, several of these omissions seem correct, but there's an obvious issue (highlighted)
This is due to this org not having a ror assigned in high_resolution_entities
, but having a ror in affiliations
, and so getting excluded from regex matching but also not getting matched using ror id.
select * from high_resolution_entities.aggregated_organizations where CSET_id = 845 - no ror
These increased
select count(0) from staging_ai_companies_visualization_old.ai_company_patents
= 151,423
select count(0) from staging_ai_companies_visualization.ai_company_patents
= 174,773
select count(0) from staging_ai_companies_visualization_old.ai_company_patent_grants
= 69,583
select count(0) from staging_ai_companies_visualization.ai_company_patent_grants
= 80,818
There are some missing links, again apparently due to ROR ids missing, from eyeballing
select old_patents.*, name, assignee, assignees_normalized.ror_id from staging_ai_companies_visualization_old.ai_company_patents as old_patents
left join
staging_ai_companies_visualization.ai_company_patents as new_patents
on
(new_patents.CSET_id = old_patents.CSET_id) and (new_patents.family_id = old_patents.family_id)
left join
staging_ai_companies_visualization_old.visualization_data
on old_patents.CSET_id = visualization_data.CSET_id
left join
unified_patents.assignees_normalized
on old_patents.family_id = assignees_normalized.family_id
where new_patents.CSET_id is null
order by family_id
limit 1000
But eyeballing some new affiliations, they look ok
select new_patents.*, name, assignee, assignees_normalized.ror_id from staging_ai_companies_visualization.ai_company_patents as new_patents
left join
staging_ai_companies_visualization_old.ai_company_patents as old_patents
on
(new_patents.CSET_id = old_patents.CSET_id) and (new_patents.family_id = old_patents.family_id)
left join
staging_ai_companies_visualization_old.visualization_data
on new_patents.CSET_id = visualization_data.CSET_id
left join
unified_patents.assignees_normalized
on new_patents.family_id = assignees_normalized.family_id
where old_patents.CSET_id is null
order by family_id
limit 1000
I also eyeballed the old and new versions of the final all_visualization_data
for Google and Hugging Face. Both lost a lot of publications. HF has no ROR id in high_resolution_entities
and this seems to be the source of the issue.
Let me know what you want to do orgs we haven't assigned RORs to in PARAT. Once that's settled one way or another, I'll rerun and hand this over to you for further checking if that works?
So, to clarify here: we fully annotated every single ROR organization in PARAT.
The gaps you are seeing here are because it has been 1.5 years since our annotation efforts completed, and ROR has added new organizations in that time. For example, Hugging Face did not have a ROR when we annotated; according to ROR it looks most likely it was created 2022-10-27, which is a month or two after we had wrapped up our annotation. I honestly have no idea what could have gone wrong with Synopsys, because it absolutely has GRIDs (4!) in the annotation table, and those GRIDs are linked to ROR in the ROR table, so I have no idea why it would not have ROR in high_resolution_entities; some error is happening here, but it's not an annotation problem.
I don't know if there's an easy way from your end to see the organizations whose paper totals dropped the most between the two versions, but if so, it seems like the easiest solution would be to see if there's a small set of those and if those are organizations that acquired ROR in the past 1.5 years (or had whatever weird linking problem Synopsys is having, which we should get fixed). Then we could add RORs for just those, and solve this. I doubt the number is that high? But I suppose I could be wrong.
Otherwise, we could just switch to running over everything, if this is too difficult or time-consuming.
Interesting. I just did select * from high_resolution_entities.aggregated_organizations where array_length(ror_id) = 0
and three of the first 10 records that popped up for me (e.g https://ror.org/04g9yb321) had ROR ids circa 2021. The other 7 weren't obviously in ror (i.e. I didn't immediately see a matching org when I searched in the ROR website). We should submit all missing PARAT orgs to ROR (ETA: that have publications), maybe after PARAT v2 finally gets out the door. I'm familiar with the process for doing this in bulk from doing it with James, Daniel, and Zach on another dataset so happy to do it, or support if you'd like to do it.
Regardless, I think we should either commit to a process that matches each org to ROR ids in the near-term and going forward, or switch back to doing regular expression search over all orgs regardless of whether they have ROR ids.
I assume you're not able to support further ROR matching, so I will discuss that with Zach. In the meantime, would you be able to take a look at why we're failing to match some of these orgs that were apparently in ROR in 2021, just in case it's symptomatic of some other issue? Once I confirm that we're not going to have comprehensive ROR matches in the near term I'll switch the query over to matching all orgs and can ping you again to review once the data is updated.
Zachary Arnold Analytic Lead, Emerging Technology Observatory (web http://eto.tech | LinkedIn https://www.linkedin.com/company/emerging-technology-observatory | Substack http://etoblog.substack.com) Center for Security and Emerging Technology @.***
On Mon, May 6, 2024 at 3:35β―PM Jennifer Melot @.***> wrote:
Interesting. I just did select * from high_resolution_entities.aggregated_organizations where array_length(ror_id) = 0 and three of the first 10 records that popped up for me (e.g https://ror.org/04g9yb321) had ROR ids circa 2021. The other 7 weren't obviously in ror (i.e. I didn't immediately see a matching org when I searched in the ROR website). We should submit all missing PARAT orgs to ROR, maybe after PARAT v2 finally gets out the door. I'm familiar with the process for doing this in bulk from doing it with James, Daniel, and Zach on another dataset so happy to do it, or support if you'd like to do it.
Regardless, I think we should either commit to a process that matches each org to ROR ids in the near-term and going forward, or switch back to doing regular expression search over all orgs regardless of whether they have ROR ids.
I assume you're not able to support further ROR matching, so I will discuss that with Zach. In the meantime, would you be able to take a look at why we're failing to match some of these orgs that were apparently in ROR in 2021, just in case it's symptomatic of some other issue? Once I confirm that we're not going to have comprehensive ROR matches in the near term I'll switch the query over to matching all orgs and can ping you again to review once the data is updated.
β Reply to this email directly, view it on GitHub https://github.com/georgetown-cset/parat/pull/315#issuecomment-2096766070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOAZR6ZVNIBZFVUBA4KQPNLZA7LQTAVCNFSM6AAAAABHCRNX2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJWG43DMMBXGA . You are receiving this because you were mentioned.Message ID: @.***>
Yeah I should have been more specific and said "that have publications"
Anyway, Zach's annotators haven't been adding RORs, so I think we should just fall back to doing regex search for all orgs if we don't have a way of keeping the ROR ids up to date right now. I do think it would still be a good thing to figure out why Synopsys etc don't have RORs though, mostly to make sure it isn't symptomatic of something else. But if you approve @rggelles I'll just switch over to doing regex search on everything
There does appear to be some linkage issue, because West Pharmaceutical (the example you linked) again has GRID in our annotation table. Just like Synopsys. I'd guess this is a problem with the GRID-ROR linkage code; I can look into this. But there's no annotation issue here; all of these have been annotated fully, and ones that are lacking codes in our annotation table almost definitely either don't have codes or had codes added in the past 1.5 years.
As @za158 says, many many PARAT orgs won't have codes, and ROR won't take them. They don't have any paper output, so they're not ROR orgs. They're still interesting to us though; they have patents, AI workers, etc. That's one of the primary reasons we moved to include regex in the first place.
Gotcha - but fyi Zach had the good idea to just count the number of orgs with pubs but no rors and it's 845 from SELECT count(0) FROM ai_companies_visualization.all_visualization_data where array_length(ror_id) = 0 and all_pubs > 0
That's presumably a combination of the ROR circa 1.5 years ago coverage issue + the Synopsis issue + current ROR coverage issues in some unknown proportions. Thanks for looking at the Synopsis issue. In the meantime, I'll switch over to doing regex searches for all orgs
Okay, to fix the Synopsys issue, what needs to be done is the following:
Go here and add a TRIM() around grid_joined.grid
.
This will solve the fact that some of the GRID annotation fields apparently must have ended up with spaces in them π€¦π»ββοΈ
Ok thanks - I made that change and also run regex on everything now. select count(0) from staging_ai_companies_visualization.ai_company_papers
= 142,992 which makes more sense since the number of papers increased since this was last run. @rggelles can you take the review from here?
Closes #310
Still needed:
@rggelles this is the continuation of https://github.com/georgetown-cset/parat/pull/304. It still needs some polish, but I think this is a good time to start to get your input. The basic idea of what I'm doing here is replacing
get_ai_counts.py
and its callers with a series of sql queries that use the neworg_name_matches
table to link org names to CSET ids.I want to make sure this method makes sense to both of us. If you're willing, I think a good place to start would be understanding this count discrepancy: https://github.com/georgetown-cset/parat/pull/315/files#r1587841664
If you could just take a look at that query and confirm that the method makes sense to you, I can run the pipeline up to the production copy stage and come back with a longer writeup on diffs between old and new counts.