georgetown-cset / parat

🦜 PARAT: CSET's Private-sector AI-Related Activity Tracker
https://parat.cset.tech
Other
4 stars 0 forks source link

Steep drops in publication counts into 2022 #359

Closed za158 closed 2 weeks ago

za158 commented 1 month ago

According to their detail views, every one of the "top ten" PARAT companies according to the default view (https://parat-dev.eto.tech/) posted a decline in AI publications from 2021 to 2022, often an extremely steep decline. This does not seem likely.

Examples image image image

cc @rggelles and (given prior experience) @jamesdunham

jmelot commented 1 month ago

I checked a couple of these companies and the big drop was present on 3/6, before my changes to the PARAT pipelines and merged corpus were merged. Presumably this is related to https://github.com/georgetown-cset/cset_article_schema/issues/141 . I'll chat with James about it again when he gets back

rggelles commented 1 month ago

Yeah, as @jmelot say, I don't think this is a PARAT-specific issue; it's likely related to the remaining apparent drop in US AI publications that was still outstanding in https://github.com/georgetown-cset/cset_article_schema/issues/141.

jmelot commented 1 month ago

Notes to self. I checked Alibaba, hackily

All pubs has 2021 peak

select
  year,
  count(distinct(merged_id)) as n
from 
  literature.affiliations
left join
  literature.papers
using (merged_id)
where lower(org_name) like "%alibaba%"
group by year
order by year desc

Image

AI pubs has 2021 peak

select
  papers.year,
  count(distinct(merged_id)) as n
from 
  literature.affiliations
left join
  literature.papers
using (merged_id)
left join
  article_classification.predictions
using (merged_id)
where (lower(org_name) like "%alibaba%") and (ai or cv or robotics or nlp)
group by papers.year
order by papers.year desc

Image

arXiv metadata has hardly anything

select
  extract(year from created) as year,
  count(distinct(id)) as n
from 
  gcp_cset_arxiv_metadata.arxiv_metadata_latest
cross join unnest(authors.author) as an_author
cross join unnest(affiliation) as an_affiliation
where lower(an_affiliation) like "%alibaba%"
group by year
order by year desc

Image

arXiv fulltext, though, shows growth in mentions (not necessarily affiliations but it's certainly suggestive)

select
  extract(year from created) as year,
  count(distinct(id)) as n
from 
  gcp_cset_arxiv_metadata.arxiv_metadata_latest
left join
  gcp_cset_arxiv_full_text.fulltext
using(id)
where lower(original_text) like "%alibaba%"
group by year
order by year desc

Image

I'll check to see whether this holds for a few more companies. It could be that either people are not publishing outside arXiv in AI as much as they might have once done, that there's more delay than we thought between research and publication in a formal venue, or something else.

It also suggests poor affiliation coverage in arXiv metadata -- no big surprise, it looks like only about 7.2% of pubs have affiliations in the metadata from

select
  extract(year from created) as year,
  count(distinct(id)) as n
from 
  gcp_cset_arxiv_metadata.arxiv_metadata_latest
left join
  gcp_cset_arxiv_full_text.fulltext
using(id)
where lower(original_text) like "%alibaba%"
group by year
order by year desc
jmelot commented 1 month ago

I see something similar for google (although there the apparent peak in AI papers is way back in 2020 if you rely only on our affiliations + classifier). Here are the arXiv fulltext counts (although "google" is probably unusually likely to occur in other places in the text, compare to counts in comment below)

Screenshot 2024-06-05 at 8 58 16 AM

Adding to the DS sync agenda @rggelles @jamesdunham

Btw, same thing for e.g. Adobe

Image Image Image

jmelot commented 1 month ago

Addendum: OA shows an overall publication peak for google in 2021 and a peak for AI (according to a hacky keyword-based definition) in 2020. Same thing for Adobe (both in 2020)

SELECT
  extract(year from publication_date) as year,
  count(distinct(works.id)) as n 
FROM openalex.works
left join unnest(authorships) as authorship
left join unnest(institutions) as institution
where (lower(institution.display_name) like "%google%") or (lower(institution.display_name) like "%alphabet%")
group by year order by year desc
Screenshot 2024-06-05 at 9 09 34 AM
SELECT
  extract(year from publication_date) as year,
  count(distinct(works.id)) as n 
FROM openalex.works
left join unnest(authorships) as authorship
left join unnest(institutions) as institution
where ((lower(institution.display_name) like "%google%") or (lower(institution.display_name) like "%alphabet%")) and (abstract like "%AI%" or lower(abstract) like "%machine learning%" or lower(abstract) like "%artificial intelligence%" or lower(abstract) like "%neural network%" or abstract like "%LLM%")
group by year order by year desc
Screenshot 2024-06-05 at 9 15 18 AM
daniel-chou commented 1 month ago

Per our discussion at the the DS Sync on 2024-06-05, @daniel-chou will explore whether more authors jointly publish AI papers (like the trend in High-Energy Physics). One way to start is examining trends of the number of authors per AI paper.

It is possible the "amount of AI research effort" increased in recent years despite the decrease in annual publication output.

rggelles commented 1 month ago

So, I wanted to experiment with a few companies that aren't quite so prolific. Here's Panasonic, using OpenAlex. As you can see, its peak is over 10 years ago.

image

And here it is in the full merged corpus:

image

Here's just AI. You'll notice the peak is now much later, either 2019 or 2022 depending on how you interpret:

image

I've excluded arXiv metadata because it barely appears -- just 3 total mentions. But here's arXiv fulltext:

image

In this one we get a 2023 peak! So the pattern does persist with the somewhat less popular company.

I'll try one more, even less famous, below, and then do some more experiments.

rggelles commented 1 month ago

Okay, now let's try with Schlumberger.

For OA, we're seeing a total paper peak all the way back in 2015:

image

For the full merged corpus, it's in 2015 as well:

image

Focusing on just AI, we see it later, in 2022:

image

Archive metadata, again, has nearly no results to speak of, and what's there is largely from the late 1990s. Archive fulltext, however, has a peak in 2019:

image

I'm not sure what this says. Even Panasonic runs the risk of being a "relevant" enough company to be mentioned in arXiv papers for reasons other than affiliations. Schlumberger probably doesn't. So this peak is probably natural, and it's much earlier than the other ones we've seen. I think my next step is I'm going to take a look at the 2023 paper fulltext and see where the actual appearances of the affilations are showing up.

rggelles commented 1 month ago

Alright, some quick annotation work here: I annotated 118/125 of the Panasonic arXiv fulltext papers from 2023, and only 23/118, or 19%, actually had a true Panasonic affiliation. So I am pretty skeptical of using the arXiv fulltext method for any analysis whatsoever on what the expected results here should be.

rggelles commented 1 month ago

When I run the following instead of the original arXiv fulltext query, I get numbers that are closer to accurate:

select
  extract(year from created) as year,
  count(distinct(id)) as n
from 
  gcp_cset_arxiv_metadata.arxiv_metadata_latest
left join
  gcp_cset_arxiv_full_text.fulltext
using(id)
where lower(SUBSTRING(original_text, 0, 300)) like "%panasonic%"
group by year
order by year desc

image

They're not perfect, because in some cases affiliations are footnoted and show up at the bottom of the page instead of near the top, but they're quite close.

Trying this with alibaba, we get:

image

So we're still getting a 2023 peak, as with Panasonic. But it's not a particularly significant one. And counts actually go down (slightly) from 2021 to 2022.

What about google?

image

Again, we do see a peak in 2023! But before that we see a noticeable drop in 2021 and 2022, before finally returning to counts only barely above 2020 in 2023. So we're just not seeing consistency here, which isn't surprising given that we're looking at individual companies, across all their research (not just AI!) over a volatile period of time, in only one research source that has a very particular use.

Basically: I'm just not sure the arXiv numbers are meaningful for telling us about anything other than trends in arXiv.

jmelot commented 1 month ago

Thanks for doing this analysis (and I admire your annotation stamina...). Checking the first segment of the FT for mentions was a good idea. I also annotated a few papers for alibaba and noticed that there were a bunch of false positives (mentions in references, research results). When I switched to an email address-based search 10/10 I checked had alibaba in the affiliations list

Screenshot 2024-06-06 at 12 20 45 PM

I tried the same thing with google and 5/5 of these I checked had google affiliations. If anything, these may be undercounts, but they should be much higher-precision

Screenshot 2024-06-06 at 12 32 56 PM

Anyway, as you say, these are counts of all papers in arXiv. We could easily either search the fulltext for key terms, mentions of cs.AI, or join to the predictions table to filter to AI though. I think that's still worth doing

rggelles commented 1 month ago

Okay, I took a look arXiv fulltext AI, using your email method and joining to the predictions table.

Here's Alibaba:

image

and Google:

image

I tried Panasonic with this method and got zero papers; from a quick scan of the ones in the previous list, my guess is this is some combo of a lot of Panasonic's arXiv papers not being AI and a lot of Panasonic's authors not including their emails for whatever reason.

Still, this is interesting: Alibaba still has a peak in 2023, but Google doesn't (which is also true in your query above, even when we don't narrow to AI-only.) Instead, it peaks in 2020.

jmelot commented 1 month ago

Hmmmm that is very interesting! I'll try this with a couple more of the top few companies in parat after the DT meeting...

rggelles commented 1 month ago

I messed up the Panasonic query (my fault for rushing; had a delivery person I was dealing with.

So that's here. Back to a 2023 peak, at...3 papers more than 2022:

image

And here's Nvidia, which has a more apparent peak in 2023:

image

Bosch, also a peak in 2023, but with very small counts, despite an apparently high number of total AI papers?:

image

Okay, I think my next step is to look at what percentage of AI papers are on arXiv by year. @jmelot Do you know which of our sources incorporate arXiv in their ingest? I'd ideally also like to look at what percentage of AI papers are solely on arXiv by year, but if most of our sources include all of arXiv in them anyway that will be rather difficult.

rggelles commented 1 month ago

So this is our arXiv percentage for the whole merged corpus:

image

I think the two notable things here are 1. that it's far far higher for 2024 than for previous years. This tells us that yes, our 2024 numbers may very well be off -- It's likely papers simply haven't filled in yet from more normal venues, even half a year in. This isn't certain, since we're not looking at exclusive arXiv papers here, just papers that appear in arXiv at all, but given the extreme jump it seems likely. And 2. there has been a gradual increase in the arXiv percentage in the corpus. Across all of the literature, more of the papers we're seeing are being published at some point on arXiv. That doesn't mean they're not later published elsewhere, but they do show up in arXiv.

Let's check this with AI in particular.

image

Here, our progression isn't nearly as steady. arXiv is a greater percentage of the overall AI corpus in 2023 than it was in 2022 or 2021, or even 2020. But that percentage dropped from 2020 to 2021, and didn't rebound until 2023. In fact, this looks more like the Google progression we've seen in some of our previous tables.

rggelles commented 4 weeks ago

I wanted to take a look at these percentages for some datasets that aren't arXiv, to see how they fluctuate year-to-year and if that could be affecting things.

Here's WOS overall:

image

And WOS AI-only:

image

You'll notice in both cases a pretty steep drop in the percentage of our corpus that comes from WOS in 2023 vs. in 2021 and 2022. In the first case, the 2021 and 2022 numbers are big rises from 2020 and before, but for AI, they seem pretty standard, so the 2023 drop is a major outlier. This might be one reason our 2023 numbers look so different; maybe we're relying on WOS in some way (e.g. maybe they do better at filling in affiliations?) and their data takes longer to fill in for recent years?

Following up on this, let's look at the percentage of papers that have affiliations in WOS by year:

image

Let's compare this to the total merged corpus:

image

Okay, yes, WOS has much much higher affiliation percentages than the total merged corpus. So a steep drop in the percentage of our dataset coming from WOS in 2023 would make a big difference in how many affiliations we're getting for that year.

The question is: what's the cause of the WOS drop? Do their paper just have a longer delay in appearing? Are they providing fewer total papers overall? I think to track this down, we need to look at the most recent WOS data deliveries and see what proportion of their "new" papers provided are from each of the various publication years. That is, are they mostly providing us with 2024 papers, given that it's now 2024? Or are we still getting a lot of 2023, and maybe even some 2022, papers?

A delayed delivery is the easiest explanation. If it's not that, then we may be looking at just a higher proportion of our dataset coming from providers that are less good at adding affiliation data over time, which is just something we'd have to deal with.

jmelot commented 3 weeks ago

First, thank you for looking into this! I really appreciate the help tracking this down.

I think you're onto something with the WOS. They are mostly providing us with 2024 papers - in fact more so than say OA over a roughly equivalent time interval

Screenshot 2024-06-10 at 3 58 06 PM Screenshot 2024-06-10 at 3 57 57 PM

But WOS is also providing a higher fraction of 2023 papers still

Screenshot 2024-06-10 at 4 01 16 PM Screenshot 2024-06-10 at 4 00 56 PM

"The WOS process is slower but results in more affiliations" seems like a reasonable guess about what's happening. If no one has anything further they want to look at here, we can push the interval we mark incomplete back a year in parat and anywhere else we're displaying graphs based on affiliations.

@jamesdunham ?

daniel-chou commented 3 weeks ago

Showing prelim results with Alibaba AI output (paper is analysis unit) that peaked in 2021:

Screenshot 2024-06-12 at 12 01 41
WITH
  ai_papers AS (
  SELECT
    DISTINCT merged_id
  FROM
    article_classification.predictions
  WHERE
    ai_filtered),
  alibaba_papers AS (
  SELECT
    DISTINCT merged_id
  FROM
    literature.affiliations
  WHERE
    LOWER(org_name) LIKE "%alibaba%"
    OR org_name LIKE "%阿里巴巴%"
    OR LOWER(org_name) LIKE "%alicloud%"
    OR org_name LIKE "%阿里云%"
    OR org_name LIKE "%DAMO%"
    OR org_name LIKE "%达摩%" )
SELECT
  year,
  COUNT(DISTINCT merged_id) AS ai_papers
FROM
  literature.papers
INNER JOIN
  ai_papers
USING
  (merged_id)
INNER JOIN
  alibaba_papers
USING
  (merged_id)
WHERE
  year > 2010
  AND year < EXTRACT(YEAR
  FROM
    CURRENT_DATE())
  AND (doctype = "Journal"
    OR doctype = "Conference"
    OR doctype = "Preprint")
GROUP BY
  year
ORDER BY
  year DESC
daniel-chou commented 3 weeks ago

Author-based prelim results with Alibaba AI output (author-paper is analysis unit) also peaked in 2021:

Screenshot 2024-06-12 at 12 01 28

Note the standardization cleanup of author names.

WITH
  ai_papers AS (
  SELECT
    DISTINCT merged_id
  FROM
    article_classification.predictions
  WHERE
    ai_filtered),
  alibaba_papers AS (
  SELECT
    DISTINCT merged_id
  FROM
    literature.affiliations
  WHERE
    LOWER(org_name) LIKE "%alibaba%"
    OR org_name LIKE "%阿里巴巴%"
    OR LOWER(org_name) LIKE "%alicloud%"
    OR org_name LIKE "%阿里云%"
    OR org_name LIKE "%DAMO%"
    OR org_name LIKE "%达摩%" ),
  alibaba_authors_ai AS (
  SELECT
    DISTINCT year,
    merged_id,
    -- In Merged Corpus, the Chinese name "Wang, Hao" appears as "Wang, Hao", "Hao Wang", and "Hao WANG" for the same paper, so we need to deduplicate
    -- Extract string before the first comma as the surname and after the first comma as the given name
    -- TODO: suffixes and variants with and without with middle initials in the same record
    INITCAP(LOWER(REPLACE(TRIM(CONCAT(TRIM(COALESCE(REGEXP_EXTRACT(author_name, "(.+?), "), author_name)), " ", TRIM(COALESCE(REGEXP_EXTRACT(author_name, ".+?, (.*)"), "")))), "  ", " "))) AS author_name
  FROM
    literature.papers
  INNER JOIN
    literature.authors
  USING
    (merged_id)
  INNER JOIN
    ai_papers
  USING
    (merged_id)
  INNER JOIN
    alibaba_papers
  USING
    (merged_id)
  WHERE
    year > 2010
    AND year < EXTRACT(YEAR
    FROM
      CURRENT_DATE())
    AND (doctype = "Journal"
      OR doctype = "Conference"
      OR doctype = "Preprint") )
SELECT
  -- DON'T use DISTINCT becauase we're counting author-paper and not just paper
  year,
  COUNT(author_name) AS ai_author_papers
FROM
  alibaba_authors_ai
GROUP BY
  year
ORDER BY
  year DESC
jmelot commented 3 weeks ago

Thanks again everyone. I'm going to open a PR pushing the incomplete data interval back a year in parat closing this issue. We can do a little "read out" in the next DS sync as well

jamesdunham commented 2 weeks ago

I can contribute a few more results here, which is consistent with conclusions above.

https://docs.google.com/spreadsheets/d/1WNTkXizJ575trk0ek-9DW-fxpKM5Vawg_ZK8jd8blDc/edit?gid=665584497#gid=665584497

jmelot commented 2 weeks ago

Closing after discussion in DS sync