women's topics over time

Armand1 commented 4 years ago

I first looked at the percentage of each paper that is about all ``women's topics'' by year.

Rplot

We find in 1948 the average paper allocates 2.7% to women's topics; in 2018, the average paper allocates about 4.7%. The fit shown here is a beta regression suitable for proportions. Here, the increase is 0.008 % per year.Of course, a simple model like this does not capture some of the ups and downs. (Note --- re-run with latest topic classification)

This is bit more volatile than the incidence of the word "woman/en" as shown here.

Armand1 commented 4 years ago

We can look at the topic data another way: by discretizing the topics. Here, I assumed that a paper was "about women" if the summed probabilities of the women's topics had a probability of occurrence > 0.05. This, too shows, an increase over time. So in 1948 about 10% of papers are about women by this criterion; in 2018 about 14% are. The general pattern is very similar to the above plot which does not discretize the topics, though the relative increase appears to be less. Here, the glm coefficient (odds ratio) is 1.004: the factor by which women's topics increase per year. As the gam shows, however, there are times when the increase is fast (but there's a decrease too).

incidence all womens topics by pubdate

The glm naturally tests whether women's topics are, in aggregate, increasing faster than all other topics. The dec_pub_date estimate is P=1.34e-10, which is very highly significant. So we can say with confidence that, taken in the aggregate, women's topics are increasing.

Armand1 commented 4 years ago

So, we have estimated the annual representation of women in BMJ articles in three ways:

1) estimating the average percentage of times that the words "woman/en" appear among all words in the articles of a given year 2) estimating the average percentage of articles in a given year that are about women, using discretized topics.

Both trends show ups and downs. But if we ignore those and just fit regressions to the overall trend we find that

1) increases by a factor 1.02 per year. 2) increases by about 1.004 per year

Which is an order of magnitude slower. So that implies that there are increasingly more mentions of women than there are topics devoted to women --- perhaps women are being mentioned more often in papers that contain no women's topics.

Armand1 commented 4 years ago

We can do the same thing with our topic_classes (i.e., counting topic classes by summing the probabilities of the topics within each topic_class and then discretizing them at 0.05). Here, I have excluded all women's topics from any given topic class. In effect, this treats women as a topic class.

preview

So, for some context: In 1948, about 1% of papers were about clinical_research_trial; in 2018 about 48% were. This represents an increase of 0.057 % per year. So the rate of increase is about an order of magnitude faster than the women's supertopic (0.007). In 1948 about 77% papers were about clinical_research_case; in 2018 about 7% were. This represents a decrease of of 0.06% per year.

public health, physiotherapy, addiction all increase; surgery, pathology, neurology, dermatology all decrease; others show increases and decreases. In general, the rate at which the women's supertopic increases is in the 72nd percentile.

Note that the count of woman/en increases much faster (99th) percentile. So that tells us that women are being mentioned ever more frequently in papers that aren't specifically about women's health issues.

The decreases of some of these topic_classes also tells us that the reason that women's health issues is increasing is not just because clinically relevant topics are increasing generally. We have successfully filtered out the rubbish papers and rubbish topics.

Armand1 commented 4 years ago

I divided the women's topics into 9 "subclasses". This was to combine, for example, the several breast cancer topics. I then asked, for each paper, if this subclass was present at the 0.05 cutoff (as above). Thus I discretized them. Then I plotted the percentage of papers that were about that subclass by year and fitted a GAM using the beta distribution.

preview

This is the core of the paper I think. It shows how the topics increase and decrease over time as the particular concerns come and go --- and can be explained.

Note that we do not see the general increase in women's topics after 2005 that we found a year ago! That was, as we thought, an artifact! Though some subtopic classes do increase: is there still room for a Godlee effect?

Armand1 commented 4 years ago

If we look at monthly averages, rather than yearly, we do see an increase after 2005! It seems that, earlier on, month-to-month variation in the articles tends to flatten out the curves so that the signal is more evident in yearly aggregation. Later, around 2005, the consistent trend up tends to show up more clearly in the monthly aggregation. But the data are exactly the same. Rplot This does really look like a Godlee effect. We need to look at the other topic_classes in the same way. But I have no doubt that many of them will decline or remain unchanged.

Armand1 commented 4 years ago

Here is the same analysis for all the other topic_classes.

monthly

You will see that lots of topic_classes increase after 2005! Clinical practice method decreases. The puzzling thing is that this shows up much less clearly in the yearly aggregates rather than the monthly ones. It's as if papers about a topic are concentrated in months.

I wonder if this is due to the way the data are aggregated (zero are NA not zero)--- and that there's a change in the way the papers are published.

Armand1 commented 4 years ago

Yes --- this is due to monthly aggregation artifacts. The BMJ switched at some point to putting out articles individually, or in small groups, rather than issues. So in the 2000s there are suddenly lots of months that don't have any articles of a given type. In the above analyses these were excluded (inadvertently). Here they are given "zero". Unfortunately, fitting a GAMS does not work very well. Now, the fits get dragged down to zero. I think we have to stick with yearly aggregates, and make sure that all years are represented for all topics.

full matrix

Or maybe it can be modeled as binomial based on individual papers.

Armand1 commented 4 years ago

Now let's look at the individual topic classes. It is clear to me that, because of a change in the way that papers were published, fitting models to aggregated data (especially by month, perhaps even by year) leads to apparently large increases of many topics in the early 2000s. But this is an artifact. It arises because the number of research papers published per month declines around then (they also get longer). This means that there are many months without a given topic being present at all, and aggregation (unless you're very careful) will score those as NA rather than zero --- leaving you with high frequencies in many topics or topic classes in those months in which a paper in a given topic is published.

The right way to model this, then, is to not aggregate but rather to do the following: For each paper, we determine whether a given topic class is present or not (1 or 0) at given pubdate. To get N_present we count, for each pubdate, the number of papers in which a given topic is present. To get N_total we count, for each pubdate, the number of topics present in all papers (so not just the number of topics). We then get N_absent=N_total-N_present. We then use a binomial GLM or GAM to determine the probability that the topic will be present at a given publication date: gam(cbind(N_present, N_absent)~pubdate, family = binomial).

Binomial distributions are built for presence-absence data. We can then plot the predictions from these models over the frequency of a topic in a given year, F_topic. This is just as a sanity check that the predictions are good -- but we're not modelling the aggregated data.

We do GAMs (wiggly, solid, lines) and GLMs (smooth, dotted, lines). GAMS give us the most temporal detail; GLMs give us the big trend.

The result looks like this for the 49 topic_classes. The first 9, in pink, are women-related

incidence topics by pubdate

The wiggly gams look overfitted, but they're not: remember, they're based on 72k individual papers. Notice that the y axes differ between each plot. So, some topic classes that are wiggling about a lot, really aren't changing much: they're generally rare. Considering the women's topics, the big events --- contraception and abortion spikes in the 1960s are there. HRT, osteoporosis and breast cancer clearly take off towards the end of the series. Pregnancy is a funny one: why does it jag up after 2005?

Considering the non-women topic classes, the big story is clearly the rise of clinical research trials, and the decline of clinical_research_cases (bottom right). But various others increase and decrease quite monotonically.

Is there a Godlee effect? Well, it's much less apparent here. It's true that, after about 2005, at least 6 of the 9 women's topic classes show a sharp increase; but they also tend to decline again. And a number of them seem to show sort of cyclical behaviour, rising and falling --- almost in synch. We'll see if we can investigate this in detail later.

Armand1 commented 4 years ago

Are women's topic classes increasing, as whole, faster than non-women's topic classes? We can look at the individual GAM coefficients (as we did for words).

Well, the tendency is for them to be somewhat faster. The average women's topic_class increases by factor of 1.02 per year, where the average other topic increases by a factor of 1 (that is, they do not increase. Of course, on average, all topics cannot increase together since they necessarily sum to 1).

Rplot

To give some context, the fast increasing topic: clinical_research_trial increases by a factor of 1.06 annually; the next fastest is physiotherapy, also 1.06; and after that women_hormonal_therapy also 1.06. Others: women_osteoporosis 1.05; women_breast_cancer 1.03 etc.

Armand1 commented 4 years ago

The frequency of woman/en (as a word) is going up; the frequency of women's topics is going up too. But the latter is going up faster than the former. Why?

One possibility is that women are being mentioned more often in papers generally --- not just those that are about women's topics. So I classified papers, on the basis of the topics, into being "about women" and "not about women". It turns out that, in both, the frequency of "woman/en" is going up --- though in the former not as steeply as the latter. Note that the y-axes are different here.

topic by word interaction

But I am persistently puzzled by the drop that we see in (some) women's topics and women/en after around 2010. I wonder what this is about. Papers are becoming longer, and so perhaps there are more topics and more different kinds of words. Perhaps this needs to be examined using paper-length (total word count) as a covariate. It will make the models somewhat more complicated...

evanhamulyak commented 4 years ago

I am confused: the frequency of the words "woman/women" goes up by factor 1.023 as we saw in the incidence analyses, whereas the average women's topics increases by a factor 1.004 per year, much like any the average of any other topic of 1. But the women's topic class also increase by factor 1.02? Then how is the women's topics going up faster than the frequency of woman/women, as you wrote? We have now basically ruled out that this observed phenomenon is caused by women being mentioned more frequently in papers not just on women's topics, as in both the frequency goes up (maybe not as steeply, but it's not that different).

With regard to the observed drop after 2005; it may just be explained by interest in the field, scientific improvements or big changes in treatment regimens. For instance:

Contraceptives becoming available in the late 60s and a relative second spike when the association with venous thrombosis becomes apparent late 90s
HPV test screening for cervical cancer, discussion about who should be screened the last decade, that has lowered now, as the screening has been implemented
It seems pregnancy and neonatal period are still rising fast, as well as osteoporosis. The first two I get, as this will always be a specific population of interest, the latter I am not too sure. Perhaps as the babyboomers have now reached an older age, osteoporotic complications associated with aging become of interest to a relatively bigger group of people? But this is just wild guessing...

Armand1 commented 4 years ago

We have now basically ruled out that this observed phenomenon is caused by women being mentioned more frequently in papers not just on women's topics, as in both the frequency goes up (maybe not as steeply, but it's not that different).

I am not sure that's right --- surely if woman/en is appear more frequently in papers that don't have women's topics in them, that would contribute to the faster rate of increase of woman/en?

All your explanations for the post 2005 declines might well be true. But I think we want to look very carefully at this before invoking particular explanations.

Armand1 commented 4 years ago

Should we include paper-length as covariate? Now that we have gotten rid of the 1997 papers, I am not so sure.

This is a comparison of three models:

m1 publication date only

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9315258 0.0112680 -171.416 < 2e-16 dec_pub_date_stand 0.0042723 0.0006979 6.121 9.28e-10

m2 publication date + mean_paper_length_stand Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.922e+00 1.818e-02 -105.716 < 2e-16 dec_pub_date_stand 4.240e-03 6.986e-04 6.068 1.29e-09 mean_paper_length_stand 1.525e-05 2.374e-05 0.642 0.521

m3 publication date * mean_paper_length_stand Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.004e+00 2.306e-02 -86.931 < 2e-16* dec_pub_date_stand 6.875e-03 8.461e-04 8.125 4.46e-16 * mean_paper_length_stand -8.946e-05 2.971e-05 -3.011 0.0026 dec_pub_date_stand:mean_paper_length_stand 7.916e-06 1.366e-06 5.796 6.78e-09*

You can see that there is a highly significant effect (P values in bold) of both mean_paper_length and its interaction with publication date. That suggests that we should include it.

BUT -- if we look at the effect sizes for dec_pub_date_stand (exponentiated to give us odds ratios), we find that they are very similar. m1: 1.0042814 m2: 1.0042486 m3: 1.0068985

AND. if we look at the effect sizes in the third model for the effect of paper length and its interaction with pubdate, see that they are tiny --- orders of magnitude smaller than the main effect of pubdate.

mean_paper_length_stand: 0.9999105 dec_pub_date_stand:mean_paper_length_stand: 1.0000079

That tells us that, although there is an effect of paper length, it's significant because we have so much data, but in fact it actually makes almost no difference to the prediction of the model.
Given that it makes graphing and explaining the effect so much more difficult, I think we should forget about it. I have also examined this for woman/en and find exactly the same thing.

Armand1 / Women-in-the-BMJ

women's topics over time #4