Closed za158 closed 2 weeks ago
I checked a couple of these companies and the big drop was present on 3/6, before my changes to the PARAT pipelines and merged corpus were merged. Presumably this is related to https://github.com/georgetown-cset/cset_article_schema/issues/141 . I'll chat with James about it again when he gets back
Yeah, as @jmelot say, I don't think this is a PARAT-specific issue; it's likely related to the remaining apparent drop in US AI publications that was still outstanding in https://github.com/georgetown-cset/cset_article_schema/issues/141.
Notes to self. I checked Alibaba, hackily
All pubs has 2021 peak
select
year,
count(distinct(merged_id)) as n
from
literature.affiliations
left join
literature.papers
using (merged_id)
where lower(org_name) like "%alibaba%"
group by year
order by year desc
AI pubs has 2021 peak
select
papers.year,
count(distinct(merged_id)) as n
from
literature.affiliations
left join
literature.papers
using (merged_id)
left join
article_classification.predictions
using (merged_id)
where (lower(org_name) like "%alibaba%") and (ai or cv or robotics or nlp)
group by papers.year
order by papers.year desc
arXiv metadata has hardly anything
select
extract(year from created) as year,
count(distinct(id)) as n
from
gcp_cset_arxiv_metadata.arxiv_metadata_latest
cross join unnest(authors.author) as an_author
cross join unnest(affiliation) as an_affiliation
where lower(an_affiliation) like "%alibaba%"
group by year
order by year desc
arXiv fulltext, though, shows growth in mentions (not necessarily affiliations but it's certainly suggestive)
select
extract(year from created) as year,
count(distinct(id)) as n
from
gcp_cset_arxiv_metadata.arxiv_metadata_latest
left join
gcp_cset_arxiv_full_text.fulltext
using(id)
where lower(original_text) like "%alibaba%"
group by year
order by year desc
I'll check to see whether this holds for a few more companies. It could be that either people are not publishing outside arXiv in AI as much as they might have once done, that there's more delay than we thought between research and publication in a formal venue, or something else.
It also suggests poor affiliation coverage in arXiv metadata -- no big surprise, it looks like only about 7.2% of pubs have affiliations in the metadata from
select
extract(year from created) as year,
count(distinct(id)) as n
from
gcp_cset_arxiv_metadata.arxiv_metadata_latest
left join
gcp_cset_arxiv_full_text.fulltext
using(id)
where lower(original_text) like "%alibaba%"
group by year
order by year desc
I see something similar for google (although there the apparent peak in AI papers is way back in 2020 if you rely only on our affiliations + classifier). Here are the arXiv fulltext counts (although "google" is probably unusually likely to occur in other places in the text, compare to counts in comment below)
Adding to the DS sync agenda @rggelles @jamesdunham
Btw, same thing for e.g. Adobe
Addendum: OA shows an overall publication peak for google in 2021 and a peak for AI (according to a hacky keyword-based definition) in 2020. Same thing for Adobe (both in 2020)
SELECT
extract(year from publication_date) as year,
count(distinct(works.id)) as n
FROM openalex.works
left join unnest(authorships) as authorship
left join unnest(institutions) as institution
where (lower(institution.display_name) like "%google%") or (lower(institution.display_name) like "%alphabet%")
group by year order by year desc
SELECT
extract(year from publication_date) as year,
count(distinct(works.id)) as n
FROM openalex.works
left join unnest(authorships) as authorship
left join unnest(institutions) as institution
where ((lower(institution.display_name) like "%google%") or (lower(institution.display_name) like "%alphabet%")) and (abstract like "%AI%" or lower(abstract) like "%machine learning%" or lower(abstract) like "%artificial intelligence%" or lower(abstract) like "%neural network%" or abstract like "%LLM%")
group by year order by year desc
Per our discussion at the the DS Sync on 2024-06-05, @daniel-chou will explore whether more authors jointly publish AI papers (like the trend in High-Energy Physics). One way to start is examining trends of the number of authors per AI paper.
It is possible the "amount of AI research effort" increased in recent years despite the decrease in annual publication output.
So, I wanted to experiment with a few companies that aren't quite so prolific. Here's Panasonic, using OpenAlex. As you can see, its peak is over 10 years ago.
And here it is in the full merged corpus:
Here's just AI. You'll notice the peak is now much later, either 2019 or 2022 depending on how you interpret:
I've excluded arXiv metadata because it barely appears -- just 3 total mentions. But here's arXiv fulltext:
In this one we get a 2023 peak! So the pattern does persist with the somewhat less popular company.
I'll try one more, even less famous, below, and then do some more experiments.
Okay, now let's try with Schlumberger.
For OA, we're seeing a total paper peak all the way back in 2015:
For the full merged corpus, it's in 2015 as well:
Focusing on just AI, we see it later, in 2022:
Archive metadata, again, has nearly no results to speak of, and what's there is largely from the late 1990s. Archive fulltext, however, has a peak in 2019:
I'm not sure what this says. Even Panasonic runs the risk of being a "relevant" enough company to be mentioned in arXiv papers for reasons other than affiliations. Schlumberger probably doesn't. So this peak is probably natural, and it's much earlier than the other ones we've seen. I think my next step is I'm going to take a look at the 2023 paper fulltext and see where the actual appearances of the affilations are showing up.
Alright, some quick annotation work here: I annotated 118/125 of the Panasonic arXiv fulltext papers from 2023, and only 23/118, or 19%, actually had a true Panasonic affiliation. So I am pretty skeptical of using the arXiv fulltext method for any analysis whatsoever on what the expected results here should be.
When I run the following instead of the original arXiv fulltext query, I get numbers that are closer to accurate:
select
extract(year from created) as year,
count(distinct(id)) as n
from
gcp_cset_arxiv_metadata.arxiv_metadata_latest
left join
gcp_cset_arxiv_full_text.fulltext
using(id)
where lower(SUBSTRING(original_text, 0, 300)) like "%panasonic%"
group by year
order by year desc
They're not perfect, because in some cases affiliations are footnoted and show up at the bottom of the page instead of near the top, but they're quite close.
Trying this with alibaba, we get:
So we're still getting a 2023 peak, as with Panasonic. But it's not a particularly significant one. And counts actually go down (slightly) from 2021 to 2022.
What about google?
Again, we do see a peak in 2023! But before that we see a noticeable drop in 2021 and 2022, before finally returning to counts only barely above 2020 in 2023. So we're just not seeing consistency here, which isn't surprising given that we're looking at individual companies, across all their research (not just AI!) over a volatile period of time, in only one research source that has a very particular use.
Basically: I'm just not sure the arXiv numbers are meaningful for telling us about anything other than trends in arXiv.
Thanks for doing this analysis (and I admire your annotation stamina...). Checking the first segment of the FT for mentions was a good idea. I also annotated a few papers for alibaba and noticed that there were a bunch of false positives (mentions in references, research results). When I switched to an email address-based search 10/10 I checked had alibaba in the affiliations list
I tried the same thing with google and 5/5 of these I checked had google affiliations. If anything, these may be undercounts, but they should be much higher-precision
Anyway, as you say, these are counts of all papers in arXiv. We could easily either search the fulltext for key terms, mentions of cs.AI, or join to the predictions table to filter to AI though. I think that's still worth doing
Okay, I took a look arXiv fulltext AI, using your email method and joining to the predictions table.
Here's Alibaba:
and Google:
I tried Panasonic with this method and got zero papers; from a quick scan of the ones in the previous list, my guess is this is some combo of a lot of Panasonic's arXiv papers not being AI and a lot of Panasonic's authors not including their emails for whatever reason.
Still, this is interesting: Alibaba still has a peak in 2023, but Google doesn't (which is also true in your query above, even when we don't narrow to AI-only.) Instead, it peaks in 2020.
Hmmmm that is very interesting! I'll try this with a couple more of the top few companies in parat after the DT meeting...
I messed up the Panasonic query (my fault for rushing; had a delivery person I was dealing with.
So that's here. Back to a 2023 peak, at...3 papers more than 2022:
And here's Nvidia, which has a more apparent peak in 2023:
Bosch, also a peak in 2023, but with very small counts, despite an apparently high number of total AI papers?:
Okay, I think my next step is to look at what percentage of AI papers are on arXiv by year. @jmelot Do you know which of our sources incorporate arXiv in their ingest? I'd ideally also like to look at what percentage of AI papers are solely on arXiv by year, but if most of our sources include all of arXiv in them anyway that will be rather difficult.
So this is our arXiv percentage for the whole merged corpus:
I think the two notable things here are 1. that it's far far higher for 2024 than for previous years. This tells us that yes, our 2024 numbers may very well be off -- It's likely papers simply haven't filled in yet from more normal venues, even half a year in. This isn't certain, since we're not looking at exclusive arXiv papers here, just papers that appear in arXiv at all, but given the extreme jump it seems likely. And 2. there has been a gradual increase in the arXiv percentage in the corpus. Across all of the literature, more of the papers we're seeing are being published at some point on arXiv. That doesn't mean they're not later published elsewhere, but they do show up in arXiv.
Let's check this with AI in particular.
Here, our progression isn't nearly as steady. arXiv is a greater percentage of the overall AI corpus in 2023 than it was in 2022 or 2021, or even 2020. But that percentage dropped from 2020 to 2021, and didn't rebound until 2023. In fact, this looks more like the Google progression we've seen in some of our previous tables.
I wanted to take a look at these percentages for some datasets that aren't arXiv, to see how they fluctuate year-to-year and if that could be affecting things.
Here's WOS overall:
And WOS AI-only:
You'll notice in both cases a pretty steep drop in the percentage of our corpus that comes from WOS in 2023 vs. in 2021 and 2022. In the first case, the 2021 and 2022 numbers are big rises from 2020 and before, but for AI, they seem pretty standard, so the 2023 drop is a major outlier. This might be one reason our 2023 numbers look so different; maybe we're relying on WOS in some way (e.g. maybe they do better at filling in affiliations?) and their data takes longer to fill in for recent years?
Following up on this, let's look at the percentage of papers that have affiliations in WOS by year:
Let's compare this to the total merged corpus:
Okay, yes, WOS has much much higher affiliation percentages than the total merged corpus. So a steep drop in the percentage of our dataset coming from WOS in 2023 would make a big difference in how many affiliations we're getting for that year.
The question is: what's the cause of the WOS drop? Do their paper just have a longer delay in appearing? Are they providing fewer total papers overall? I think to track this down, we need to look at the most recent WOS data deliveries and see what proportion of their "new" papers provided are from each of the various publication years. That is, are they mostly providing us with 2024 papers, given that it's now 2024? Or are we still getting a lot of 2023, and maybe even some 2022, papers?
A delayed delivery is the easiest explanation. If it's not that, then we may be looking at just a higher proportion of our dataset coming from providers that are less good at adding affiliation data over time, which is just something we'd have to deal with.
First, thank you for looking into this! I really appreciate the help tracking this down.
I think you're onto something with the WOS. They are mostly providing us with 2024 papers - in fact more so than say OA over a roughly equivalent time interval
But WOS is also providing a higher fraction of 2023 papers still
"The WOS process is slower but results in more affiliations" seems like a reasonable guess about what's happening. If no one has anything further they want to look at here, we can push the interval we mark incomplete back a year in parat and anywhere else we're displaying graphs based on affiliations.
@jamesdunham ?
Showing prelim results with Alibaba AI output (paper is analysis unit) that peaked in 2021:
WITH
ai_papers AS (
SELECT
DISTINCT merged_id
FROM
article_classification.predictions
WHERE
ai_filtered),
alibaba_papers AS (
SELECT
DISTINCT merged_id
FROM
literature.affiliations
WHERE
LOWER(org_name) LIKE "%alibaba%"
OR org_name LIKE "%阿里巴巴%"
OR LOWER(org_name) LIKE "%alicloud%"
OR org_name LIKE "%阿里云%"
OR org_name LIKE "%DAMO%"
OR org_name LIKE "%达摩%" )
SELECT
year,
COUNT(DISTINCT merged_id) AS ai_papers
FROM
literature.papers
INNER JOIN
ai_papers
USING
(merged_id)
INNER JOIN
alibaba_papers
USING
(merged_id)
WHERE
year > 2010
AND year < EXTRACT(YEAR
FROM
CURRENT_DATE())
AND (doctype = "Journal"
OR doctype = "Conference"
OR doctype = "Preprint")
GROUP BY
year
ORDER BY
year DESC
Author-based prelim results with Alibaba AI output (author-paper is analysis unit) also peaked in 2021:
Note the standardization cleanup of author names.
WITH
ai_papers AS (
SELECT
DISTINCT merged_id
FROM
article_classification.predictions
WHERE
ai_filtered),
alibaba_papers AS (
SELECT
DISTINCT merged_id
FROM
literature.affiliations
WHERE
LOWER(org_name) LIKE "%alibaba%"
OR org_name LIKE "%阿里巴巴%"
OR LOWER(org_name) LIKE "%alicloud%"
OR org_name LIKE "%阿里云%"
OR org_name LIKE "%DAMO%"
OR org_name LIKE "%达摩%" ),
alibaba_authors_ai AS (
SELECT
DISTINCT year,
merged_id,
-- In Merged Corpus, the Chinese name "Wang, Hao" appears as "Wang, Hao", "Hao Wang", and "Hao WANG" for the same paper, so we need to deduplicate
-- Extract string before the first comma as the surname and after the first comma as the given name
-- TODO: suffixes and variants with and without with middle initials in the same record
INITCAP(LOWER(REPLACE(TRIM(CONCAT(TRIM(COALESCE(REGEXP_EXTRACT(author_name, "(.+?), "), author_name)), " ", TRIM(COALESCE(REGEXP_EXTRACT(author_name, ".+?, (.*)"), "")))), " ", " "))) AS author_name
FROM
literature.papers
INNER JOIN
literature.authors
USING
(merged_id)
INNER JOIN
ai_papers
USING
(merged_id)
INNER JOIN
alibaba_papers
USING
(merged_id)
WHERE
year > 2010
AND year < EXTRACT(YEAR
FROM
CURRENT_DATE())
AND (doctype = "Journal"
OR doctype = "Conference"
OR doctype = "Preprint") )
SELECT
-- DON'T use DISTINCT becauase we're counting author-paper and not just paper
year,
COUNT(author_name) AS ai_author_papers
FROM
alibaba_authors_ai
GROUP BY
year
ORDER BY
year DESC
Thanks again everyone. I'm going to open a PR pushing the incomplete data interval back a year in parat closing this issue. We can do a little "read out" in the next DS sync as well
I can contribute a few more results here, which is consistent with conclusions above.
Closing after discussion in DS sync
According to their detail views, every one of the "top ten" PARAT companies according to the default view (https://parat-dev.eto.tech/) posted a decline in AI publications from 2021 to 2022, often an extremely steep decline. This does not seem likely.
Examples
![image](https://github.com/georgetown-cset/parat/assets/58824955/4fe6174f-41e7-4658-8031-21eb5922bb94)
cc @rggelles and (given prior experience) @jamesdunham