PathwayCommons / hyper-recent

Hyper-recent article feed
MIT License
1 stars 0 forks source link

Stash (final) preprint link #40

Closed jvwong closed 1 year ago

jvwong commented 1 year ago

Description

Q: What is the name of the feature?

A: Improving redirects for preprint links

Q: What does this feature enable?

A: We want this feature to reduce the number of redirects faced by the user. Currently, a lot of the biorXiv and medrXiv preprint links on the dashboard have many redirects to pull up the final paper/latest version. This means extra wait time on the user's end, which results in an unpleasant user experience overall.

Q: What information must be provided for this feature?

A: For each paper, we'll need to have its DOI and the final link it resolves to. We also should measure time taken for a few papers to redirect and compare this time before and after the solution is implemented.

Q: What are the applicable constraints, e.g. compatibility or performance?

A: Parsing through all the papers to find the latest "final" link when displaying the dashboard UI might take the same amount of time (or even longer) than waiting for the existing DOI links to redirect. Also, so far it seems like fetch() only tells us whether or not there was a redirect, and not the number of redirects that occurred. When using the DOI links, we always have at least one redirect, so the information provided by fetch() is not very helpful in our case. As well, there's a constraint to how we can fetch the final URL - we don't want to send too many requests at once to the servers when looking for the final URL for all the topics. A sleep/wait can be implemented between groups of requests to lessen the load on the servers.

Specification

Details

jvwong commented 1 year ago

Interesting edge case: The URL for a paper is not recognized by doi.org. In this case, the biorxiv link is https://www.biorxiv.org/content/10.1101/2023.03.14.532571 (200 OK) while the doi.org one is https://doi.org/10.1101/2023.03.14.532571 (error DOI NOT FOUND)

Data:

{
            id: "10.1101/2023.03.14.532571",
            score: 8.927840580391445,
            terms: ["alzheimer"],
            match: {
                alzheimer: ["abstract"]
            },
            title: "Single-cell expression predicts neuron specific proteinhomeostasis networks",
            authors: "Pechmann, S.",
            category: "genomics",
            abstract: "The protein homeostasis network keeps proteins in their correct shapes and avoids unwanted protein aggregation. In turn, the accumulation of aberrantly misfolded proteins has been directly associated with the onset of aging-associated neurodegenerative diseases such as Alzheimer's and Parkinson's. However, a detailed and rational understanding of how protein homeostasis is achieved in health, and how it can be targeted for therapeutic intervention in diseases remains missing. Here, large-scale single-cell expression data from the Allen Brain Map is analyzed to investigate the transcription regulation of the core protein homeostasis network across the human brain. Remarkably, distinct expression profiles suggest specialized protein homeostasis networks with systematic adaptations in excitatory neurons, inhibitory neurons, and non-neuronal cells. Moreover, several chaperones and Ubiquitin ligases are found transcriptionally coregulated with genes important for synapse formation and maintenance, thus linking protein homeostasis to the regulation of neuronal function. Finally, evolutionary analyses highlight the conservation of an elevated interaction density in the chaperone network, suggesting that one of the most exciting aspects of chaperone action may yet be discovered in their collective action at the systems level. More generally, our work highlights the power of computational analyses for breaking down complexity and gaining complementary insights into fundamental biological problems.",
            date: "2023-03-15T00:00:00.000Z",
            server: "biorxiv",
            doi: "10.1101/2023.03.14.532571"
}
jvwong commented 1 year ago

Update - the doi.org URL now resolves correctly, so there's a few hours lag between biorxiv records and doi.org forwarding the URL request.

jvwong commented 1 year ago

Now i'm wondering if you can just reconstruct the final URL string with the (a) doi and (b) version following the pattern:

https://www.biorxiv.org/content/ + <doi> + v<version>

For instance, consider:

{
   "doi":"10.1101\/2022.05.04.490594",
   "title":"Population Genomics of Stone Age Eurasia",
   "authors":"Allentoft, M. E.; Sikora, M.; Refoyo-Martinez, A.; Irving-Pease, E. K.; Fischer, A.; Barrie, W.; Ingason, A.; Stenderup, J.; Sjögren, K.-G.; Pearson, A.; Sousa da Mota, B.; Paulsson, B. S.; Halgren, A. S.; Macleod, R.; Schjellerup Jorkov, M. L.; Demeter, F.; Novosolov, M.; Sorensen, L.; Nielsen, P. O.; Henriksen, R. A.; Vimala, T.; McColl, H.; Margaryan, A.; Ilardo, M.; Vaughn, A.; Mortensen, M. F.; Nielsen, A. B.; Hede, M. U.; Rasmussen, P.; Vinner, L.; Renaud, G.; Stern, A. J.; Trolle Jensen, T. Z.; Johannsen, N. N.; Scorrano, G.; Schroeder, H.; Lysdahl, P.; Ramsoe, A.; Skorobogatov, A.; Sc",
   "author_corresponding":"Morten E. Allentoft",
   "author_corresponding_institution":"Trace and Environmental DNA (TrEnD) Laboratory, School of Molecular and Life Science, Curtin University, Australia",
   "date":"2022-10-07",
   "version":"5",
   "type":"new results",
   "license":"cc_by_nc_nd",
   "category":"evolutionary biology",
   "jatsxml":"https:\/\/www.biorxiv.org\/content\/early\/2022\/10\/07\/2022.05.04.490594.source.xml",
   "abstract":"Several major migrations and population...",
   "published":"NA",
   "server":"biorxiv"
}

The final URL guess is: https://www.biorxiv.org/content/10.1101/2022.05.04.490594v5

@Suhyma what do you think? Try a few out and see - any edge cases or gotchas?

jvwong commented 1 year ago

In fact:

https://www.biorxiv.org/content/about-biorxiv

Screen Shot 2023-03-21 at 4 38 59 PM
Suhyma commented 1 year ago

So far, this seems to work well - since all the articles being used are from the past month, we don't need to worry about the DOIs assigned prior to December 11, 2019 having a different URL template. I also double checked the medrXiv documentation and they use the same template as biorXiv. We should be good to construct this URL template as a string for our solution to the redirects issue, we just need to watch for any changes to it over time. (Discussed in today's meeting, just adding it here for reference.)

Suhyma commented 1 year ago

Just found an edge case - papers that have multiple versions posted on the same day. For example, when downloading the paper data, this paper's version is set to '1' so the URL linked is what is constructed. However, upon visiting the link, the "View current version of this article" button pops up, taking us to version 2 of the article which was posted on the same day (March 21, 2023). Not too sure at the moment why the data isn't updated to the most recent version as it's been a day since the articles were posted - I'll look into this more tomorrow and see if it's still a problem.

Suhyma commented 1 year ago

Upon testing with multiple categories, I've found another instance of this edge case and can define the issue better.

The downloaded data considers multiple versions of a paper as separate papers. As mentioned previously, there are some instances where two versions of a paper are posted on the same day (usually to make minor corrections in the previous version). When this happens with the most recent papers of the topic, we can actually see the duplicate titles in the UI (which currently displays no more than 20 papers at once), where each link points to a different version of the paper as illustrated below:

Screen Shot 2023-03-24 at 2 49 55 PM

The unique situation described in the last comment was that two versions of a paper were posted on the same day, and the first link (in the order that a user would scroll through the page) pointed to the older version while the second link.

After looking a little deeper into the data file for one month's worth of data, I realized that all versions of a paper that have been released within the given date range are recorded as separate papers. There are a few duplicate titles within the past month - this isn't always obvious from the UI if one of the versions is a bit older or not posted within the last few days.

To solve this, we should filter for duplicate titles and take the latest version of the paper before putting data into data.json. We'll also be able to keep using the URL template method to regulate redirects.