Citations are stripped from papers?

suchenzang commented 1 year ago

Grepped for s2orc in train-00019-of-00020.json, and noticed that bibliographies are not included in papers (leaving citations hanging):

√ Downloads % grep s2orc train-00019-of-00020.json | head -n 1
{"id": "222288539", "source": "s2orc/train", "version": "v2", "added": "2020-10-12T01:29:07.509Z", "created": "2020-10-01T00:00:00.000Z", "text": "In Vitro Interaction of AB-FUBINACA with Human Cytochrome P450, UDP-Glucuronosyltransferase Enzymes and Drug Transporters\n\nAB-FUBINACA, a synthetic indazole carboxamide cannabinoid, has been used worldwide as a new psychoactive substance. Because drug abusers take various drugs concomitantly, it is necessary to explore potential AB-FUBINACA-induced drug\u2013drug interactions caused by modulation of drug-metabolizing enzymes and transporters. In this study, the inhibitory effects of AB-FUBINACA on eight major human cytochrome P450s (CYPs) and six uridine 5\u2032-diphospho-glucuronosyltransferases (UGTs) of human liver microsomes, and on eight clinically important transport activities including organic cation transporters (OCT)1 and OCT2, organic anion transporters (OAT)1 and OAT3, organic anion transporting polypeptide transporters (OATP)1B1 and OATP1B3, P-glycoprotein, and breast cancer resistance protein (BCRP) in transporter-overexpressing cells were investigated. AB-FUBINACA inhibited CYP2B6-mediated bupropion hydroxylation via mixed inhibition with Ki value of 15.0 \u00b5M and competitively inhibited CYP2C8-catalyzed amodiaquine N-de-ethylation, CYP2C9-catalyzed diclofenac 4\u2032-hydroxylation, CYP2C19-catalyzed [S]-mephenytoin 4\u2032-hydroxylation, and CYP2D6-catalyzed bufuralol 1\u2032-hydroxylation with Ki values of 19.9, 13.1, 6.3, and 20.8 \u00b5M, respectively. AB-FUBINACA inhibited OCT2-mediated MPP+ uptake via mixed inhibition (Ki, 54.2 \u00b5M) and competitively inhibited OATP1B1-mediated estrone-3-sulfate uptake (Ki, 94.4 \u00b5M). However, AB-FUBINACA did not significantly inhibit CYP1A2, CYP2A6, CYP3A4, UGT1A1, UGT1A3, UGT1A4, UGT1A6, or UGT2B7 enzyme activities at concentrations up to 100 \u00b5M. AB-FUBINACA did not significantly inhibit the transport activities of OCT1, OAT1/3, OATP1B3, P-glycoprotein, or BCRP at concentrations up to 250 \u03bcM. As the pharmacokinetics of AB-FUBINACA in humans and animals remain unknown, it is necessary to clinically evaluate potential in vivo pharmacokinetic drug\u2013drug interactions induced by AB-FUBINACA-mediated inhibition of CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, OCT2, and OATP1B1 activities.\n\n\nIntroduction\nSynthetic cannabinoids (SCs) are new psychoactive substances mimicking \u22069-tetrahydrocannabinol (THC), the active component of cannabis, and typically bind to the cannabinoid receptor type 1 (CB1) or type 2 (CB2) [1]. SC misuse has increased worldwide; 169 SCs are monitored by the European cannabinoid receptor type 1 (CB1) or type 2 (CB2) [1]. SC misuse has increased worldwide; 169 SCs are monitored by the European Monitoring Centre for Drugs and Drug Addiction (EMCDDA) via the EU Early Warning System established in December 2016. AB-FUBINACA (N-[(1S)-1-(aminocarbonyl)-2-methylpropyl]-1-[(4-fluorophenyl)methyl]-1Hindazole-3-carboxamide) is an SC (an indazole carboxamide) (Figure 1) exhibiting CB1 and CB2 agonist activities with EC50 values of 1.8 and 3.2 nM, respectively [2]. AB-FUBINACA was developed by Pfizer in 2009 as an analgesic drug candidate, but development was not pursued. The material was identified for the first time in Japanese herbal smoking blends in 2012 and has been included in Schedule I of the Controlled Substances Act by the US Drug Enforcement Administration since 2014 [3,4]. Use is controlled nationally in Germany, China, and Canada; and by the United Nations Office on Drugs and Crime.\nAB-FUBINCA induced acute neurological disorders such as hyperreflexia, sensorimotor alterations, and spontaneous aggressiveness in mice when administered intraperitoneally at a single dose of 6 mg/kg [5]. In humans, the symptoms reported after intoxication of SCs are tachycardia, agitation, drowsiness, and vomiting etc. [4]. Repeated exposure of SCs leads to cause serious adverse reaction including coma and death. Among the cases of severe toxic effects and deaths associated with SCs reported from 2012 to 2015, the most prevalent cases were AB-CHIMICA and cases related to AB-FUBINACA, ADB-FUBINACA, AM2201, and JWH-018 were also reported [6,7]. In addition, metabolites of SCs such as JWH-018 and JWH-073 often retain higher affinity to CB1 or CB2 than THC [8]. Given the toxic events and increasing use of SCs and retained activity of SC metabolites, determination of SCs and their metabolites in serum, urine, hepatocytes, and liver microsomes has been developed [7]. AB-FUBINACA is extensively metabolized via hydrolysis of the amide group carboxylesterases 1 (CES1) and CES2, hydroxylation of the amino-oxobutane moiety and imidazole ring, defluorobenzylation and dehydrogenation by cytochrome P450s (CYPs) 2C19, 3A4, and 3A5, glucuronidation, and combinations thereof ( [9][10][11][12], our unpublished data). The major metabolites of AB-FUBINACA identified after human liver microsomal incubation are different significantly from those in authentic human urine samples [9]. Hydrolysis of the amide group by carboxylesterases 1 (CES1) and CES2 and hydroxylation of the amino-oxobutane moiety are major metabolites found in urine and microsome samples [9,10] but hydroxylation of an imidazole ring and its glucuronide metabolite were mainly identified in intoxicated human urine samples [9]. AB-FUBINCA induced acute neurological disorders such as hyperreflexia, sensorimotor alterations, and spontaneous aggressiveness in mice when administered intraperitoneally at a single dose of 6 mg/kg [5]. In humans, the symptoms reported after intoxication of SCs are tachycardia, agitation, drowsiness, and vomiting etc. [4]. Repeated exposure of SCs leads to cause serious adverse reaction including coma and death. Among the cases of severe toxic effects and deaths associated with SCs reported from 2012 to 2015, the most prevalent cases were AB-CHIMICA and cases related to AB-FUBINACA, ADB-FUBINACA, AM2201, and JWH-018 were also reported [6,7]. In addition, metabolites of SCs such as JWH-018 and JWH-073 often retain higher affinity to CB1 or CB2 than THC [8]. Given the toxic events and increasing use of SCs and retained activity of SC metabolites, determination of SCs and their metabolites in serum, urine, hepatocytes, and liver microsomes has been developed [7]. AB-FUBINACA is extensively metabolized via hydrolysis of the amide group carboxylesterases 1 (CES1) and CES2, hydroxylation of the amino-oxobutane moiety and imidazole ring, defluorobenzylation and dehydrogenation by cytochrome P450s (CYPs) 2C19, 3A4, and 3A5, glucuronidation, and combinations thereof ( [9][10][11][12], our unpublished data). The major metabolites of AB-FUBINACA identified after human liver microsomal incubation are different significantly from those in authentic human urine samples [9]. Hydrolysis of the amide group by carboxylesterases 1 (CES1) and CES2 and hydroxylation of the amino-oxobutane moiety are major metabolites found in urine and microsome samples [9,10] but hydroxylation of an imidazole ring and its glucuronide metabolite were mainly identified in intoxicated human urine samples [9].\nThe B to A transport rate of digoxin and ES, which was calculated as the slope of the graph, in LLC-PK1-MDR1 and LLC-PK1-BCRP cells was 6.5-fold and 5.4-fold, respectively, greater than that in LLC-PK1-mock cells. Net efflux ratios of digoxin and ES were 6.6 and 6.2, respectively, in LLC-PK1-MDR1 and -BCRP cells compared with LLC-PK1-mock cells ( Figure 6A,D). The results suggest the feasibility of P-gp-and BCRP-mediated transport system. The effect of AB-FUBINACA on the P-gp-and BCRP-mediated B to A transport of digoxin and ES, respectively, was measured in a concentration range of 0.1-100 \u00b5M of AB-FUBINACA ( Figure 6B,E) to calculate the IC 50 value of AB-FUBINACA for P-gp and BCRP. As results, AB-FUBINACA did not inhibit the P-gp-mediated B to A transport rate of digoxin ( Figure 6C) and the BCRP-mediated B to A transport rate of ES ( Figure 6F) over the concentration ranges tested.  Of the eight transporters tested, the OCT2 and OATP1B1 transporters (inhibited by AB-FUBINACA) were subjected to enzyme kinetic studies to determine the modes of inhibition and the K i values. A Lineweaver-Burk plot of the inhibitory effect of AB-FUBINACA on OCT2-mediated MPP + uptake revealed mixed inhibition and a K i value of 54.2 \u00b5M ( Figure 7A and Table 3). AB-FUBINACA competitively inhibited OATP1B1-mediated ES uptake with a K i value of 94.4 \u00b5M ( Figure 7B and Table 3).\nAB-FUBINACA weakly inhibited OCT2 and OATP1B1 activities with K i values of 54.2 and 94.4 \u00b5M, respectively; it poorly inhibited OAT1, OAT3, OCT1, and OATP1B3 activities to 250 \u00b5M; and did not affect the efflux of transporters, P-gp and BCRP to 100 \u00b5M. Thus, AB-FUBINACA is at low risk of interaction with these clinically important drug transporters; AB-FUBINACA present in the blood after drug abuse may not potentiate transporter-mediated toxicity or cause an adverse event.\nTo measure time-dependent inhibition, human liver microsomes were pre-incubated with various concentrations of AB-FUBINACA (final concentrations of 0.1-100 \u00b5M) and NADPH for 30 min at 37 \u2022 C. Next, the reaction mixtures were incubated with the A or B set of CYP substrates for 15 min at 37 \u2022 C. The control reaction featured the addition of methanol rather than AB-FUBINACA.\nAn Agilent 6495 triple quadrupole mass spectrometer coupled with an Agilent 1290 Infinity system (Agilent Technologies, Wilmington, DE, USA) was used for LC-MS/MS. Metabolites of the eight CYP substrates were simultaneously separated on an Atlantis dC18 column (3 \u00b5m, 2.1 mm internal diameter \u00d7 100 mm; Waters Co., Milford, MA, USA) using a gradient elution of 5% methanol in 0.1% formic acid (MP A) and 95% methanol in 0. The data were processed using MassHunter software ver. B.07.00 (Agilent Technologies). Typical SRM chromatograms and method validation data of eight CYP metabolites are shown in Supplementary Figure S1 and Table S1, respectively.\n\nInhibitory Effects of AB-FUBINACA on Six Major UGT Activities\nThe inhibitory effects of AB-FUBINACA on UGT1A1, UGT1A3, UGT1A4, UGT1A6, UGT1A9, and UGT2B7 were evaluated using our previously described LC-MS/MS method after incubation of ultrapooled human liver microsomes with a cocktail of UGT substrates [27][28][29][30]. Each incubation mixture was prepared in a final volume of 100 \u00b5L as follows: ultrapooled human liver microsomes  Figure S2 and Table S1, respectively.\n\nInhibitory\nEffects of AB-FUBINACA on the Transport Activities of OCTs, OATs, OATPs, P-gp, and BCRP HEK293 cells overexpressing OCT1, OCT2, OAT1, OAT3, OATP1B1, and OATP1B3 transporters and HEK293-mock cells were seeded into poly-D-lysine-coated 96-well plates at 10 5 cells/well and cultured in DMEM supplemented with 10% FBS, 5 mM nonessential amino acids, and 2 mM sodium butyrate for 24 h at 37 \u2022 C in a humidified atmosphere under 8% CO 2 . After 24 h, the medium was discarded and the attached cells were washed with pre-warmed HBSS and pre-incubated for 10 min in pre-warmed HBSS at 37 \u2022 C.\nTo examine the effects of AB-FUBINACA on the activities of six uptake transporters, the probe uptakes into HEK293 cells overexpressing the transporters were measured in the presence of AB-FUBINACA (0-250 \u00b5M) for 5 min. The concentrations and probe substrates were as follows: 0.  ). After 5 min, the incubation medium was discarded and the attached cells were washed three-times with ice-cold HBSS (200 \u00b5L each time) and lysed with 10% SDS solution (50 \u00b5L). The lysates were mixed thoroughly with Optiphase scintillation cocktail (250 \u00b5L) and radioactivity measured using a liquid scintillation counter. The uptake of probe into HEK293-mock cells was measured using the same protocol. The transporter-mediated uptake of probe was calculated by subtracting the uptake of HEK293-mock cells from the uptake of HEK293 cells overexpressing the OCT1, OCT2, OAT1, OAT3, OATP1B1, and OATP1B3 transporters.\nLLC-PK1-MDR1, LLC-PK1-BCRP, and LLC-PK1-mock cells were grown in tissue culture flasks in medium 199 supplemented with 10% FBS and 50 \u00b5g/mL gentamycin, seeded into 24-well transwell plates at 10 5 cells/well for 5 days to TEER values over 450 \u03a9\u00b7cm 2 . The B to A transport of digoxin by LLC-PK1-MDR1 and -mock cells was initiated by adding 0.8 mL of HBSS containing 0.1 \u00b5M Aliquots (100 \u00b5L) of transported samples were mixed with 200 \u00b5L of Optiphase scintillation cocktail and radioactivity was measured with a liquid scintillation counter. The B to A transports of digoxin and ES (mediated by P-gp and BCRP, respectively) were calculated by subtracting the probe transport of LLC-PK1-mock cells from those of LLC-PK1-MDR1 and -BCRP cells. The P-gp-mediated B to A transport rate of digoxin and the BCRP-mediated B to a transport rate of ES were calculated from the slope of the B to A transports of digoxin and ES versus time profile. Efflux ratio was calculated by dividing B to A transport ratio of probe substrate by A to B transport rate in LLC-PK1-MDR1, LLC-PK1-BCRP, and LLC-PK1-mock cells and net efflux ratio was calculate by dividing efflux ratio of probe substrate in LLC-PK1-MDR1 or LLC-PK1-BCRP by efflux ratio in LLC-PK1-mock cells.\n\nData Analysis\nThe inhibition data were fitted to an inhibitory effect model using Sigma Plot ver. 12.0 (Systat Software Inc., San Jose, CA, USA) to obtain the IC 50 values (the half-maximal inhibitory concentrations) of AB-FUBINACA [30]. K i (inhibition constant) values of AB-FUBINACA and the mode of inhibition were calculated/derived by drawing Lineweaver-Burk plots [39] using Enzyme Kinetics ver. 1.1 (Systat Software, Inc.).\n\nConclusions\nThe in vitro inhibitory effects of AB-FUBINACA on eight major clinically important CYP and six UGT enzymes of ultrapooled human liver microsomes and on six solute carrier transporters and two efflux transporters using a transporter expression system were investigated for the first time to predict the drug interaction potential of AB-FUBINACA via the modulation of drug-metabolizing enzymes and transporters. AB-FUBINACA moderately inhibited CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, OCT2, and OATP1B1 activities with K i values of 15.0, 19.9, 13.1, 6.3, 20.8, 54.2, and 94.4 \u00b5M, respectively. Although in vitro inhibition of CYP, UGT, and transporter activities does not necessarily translate into significant DDIs using a basic prediction model, it is necessary to evaluate the in vivo potential of AB-FUBINACA to cause DDIs via the inhibition of CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, OCT2, and OATP1B1 activities.\nSupplementary Materials: The following are available online, Figure S1: SRM chromatograms of CYP metabolites formed from human liver microsomal incubation of eight CYP cocktail substrates with NADPH and two IS, Figure S2: SRM chromatograms of UGT metabolites formed from human liver microsomal incubation of six UGT cocktail substrates with UDPGA and two IS, Table S1: Concentration ranges and correlation coefficients of the calibration curves and precision (coefficient of variation, CV) and accuracy values for CYP and UGT metabolites."}

Any chance these could be added back in?

soldni commented 1 year ago

Hi @suchenzang!

We did a few ablations training with and w/o citations, and found that citations extracted as-is generally hurt representation learned by the model. We are considering adding them back in with consistent formatting across papers in the next version.

If you need citations sooner, I would recommend requesting access to S2ORC, which peS2o is derived from, and extract citations from there.

Hope this helps!

Best, Luca

suchenzang commented 1 year ago

Thanks @soldni for the info! Could you clarify what you mean by "hurting representations learned by the model"? I'm currently interpreting that to mean hurting standard NLP benchmarks if citations are improperly included, but perhaps there were other probes into learned representations? What model size was used for these ablations?

Thanks for this work, btw. Open-sourcing processed datasets are sorely needed in the field!

soldni commented 1 year ago

Sure thing!

We run ablations with 1B autoregressive models trained from scratch on different version of peS2o and evaluated zero shot on a suite of downstream tasks (oepnbook QA, copa, rte, sst2, arc easy, hellaswag, sciq, piqa, winogrande) as well as perplexity on the held-out set from M2D2. We also monitor train loss. Broadly, we saw metrics decrease when everything else but which content to include was equal.

We plan to release all models and results in a future manuscript, just did not get a chance to write it down yet 😅

allenai / peS2o

Citations are stripped from papers? #1