bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
75 stars 49 forks source link

Create dataset s2orc_the_semantic_scholar_open_research_corpus #127

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
yjernite commented 2 years ago

cc @kyleclo

j-chim commented 2 years ago

self-assign

j-chim commented 2 years ago

Exists in the hub. However based on the data card, the current version is likely v20190928 rather than the latest one. This means the dataset is missing the updates detailed here.

albertvillanova commented 2 years ago

Thanks @j-chim.

Link: https://huggingface.co/datasets/s2orc

The version of the canonical dataset has date 2020-12-01:


_ROOT_URL = "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2020-12-01/"
albertvillanova commented 2 years ago

Sample:


{'id': '984774e366d3d4fcf0fce659f697478dccd4f93c',
 'title': 'Fully convolutional architecture vs sliding-window CNN for corneal endothelium cell segmentation',
 'paperAbstract': 'BackgroundCorneal endothelium (CE) images provide valuable clinical information regarding the health state of the cornea. Computation of the clinical morphometric parameters requires the segmentation of endothelial cell images. Current techniques to image the endothelium in vivo deliver low quality images, which makes automatic segmentation a complicated task. Here, we present two convolutional neural networks (CNN) to segment CE images: a global fully convolutional approach based on U-net, and a local sliding-window network (SW-net). We propose to use probabilistic labels instead of binary, we evaluate a preprocessing method to enhance the contrast of images, and we introduce a postprocessing method based on Fourier analysis and watershed to convert the CNN output images into the final cell segmentation. Both methods are applied to 50 images acquired with an SP-1P Topcon specular microscope. Estimates are compared against a manual delineation made by a trained observer.ResultsU-net (AUC=0.9938) yields slightly sharper, clearer images than SW-net (AUC=0.9921). After postprocessing, U-net obtains a DICE=0.981 and a MHD=0.22 (modified Hausdorff distance), whereas SW-net yields a DICE=0.978 and a MHD=0.30. U-net generates a wrong cell segmentation in only 0.48% of the cells, versus 0.92% for the SW-net. U-net achieves statistically significant better precision and accuracy than both, Topcon and SW-net, for the estimates of three clinical parameters: cell density (ECD), polymegethism (CV), and pleomorphism (HEX). The mean relative error in U-net for the parameters is 0.4% in ECD, 2.8% in CV, and 1.3% in HEX. The computation time to segment an image and estimate the parameters is barely a few seconds.ConclusionsBoth methods presented here provide a statistically significant improvement over the state of the art. U-net has reached the smallest error rate. We suggest a segmentation refinement based on our previous work to further improve the performance.',
 'entities': [],
 's2Url': 'https://semanticscholar.org/paper/984774e366d3d4fcf0fce659f697478dccd4f93c',
 'pdfUrls': ['https://bmcbiomedeng.biomedcentral.com/track/pdf/10.1186/s42490-019-0003-2'],
 's2PdfUrl': '',
 'authors': [{'name': 'Juan P. Vigueras-Guillén', 'ids': ['1404113169']},
  {'name': 'Busra  Sari', 'ids': ['113782037']},
  {'name': 'Stanley F. Goes', 'ids': ['1419382500']},
  {'name': 'Hans G. Lemij', 'ids': ['82955974']},
  {'name': 'Jeroen  van Rooij', 'ids': ['143638168']},
  {'name': 'Koenraad A. Vermeer', 'ids': ['1693033']},
  {'name': 'Lucas J. van Vliet', 'ids': ['134339884']}],
 'inCitations': ['c62f7bfce0aee11940925f4619d9f9be60c7b5ba',
  '0d1b8f40bfa29acd1ef6ced824bc89f74426422c',
  '8d4e0d120887cba389ad8897b660e8c663de6bb6',
  'ec3aad2e8c5ab6bdd3802cb7d147684541dc4631',
  'a38b350c48aa54b4905ce09877231949eb9ef96d',
  '61aa49a2cf81075bf7e2d081aa897a78f91b35c1',
  '178502d887d652758d33e453a85c85db3114cc1f',
  'e37385e5196dc2a884bea325bd939006f6eaf8ff',
  'f513c0e665f02c23b7cc04492ce161865c84785a',
  'bb8c06e6d41175417a951aaa9c59691b1a74cf59',
  'e999babf70ceb7382233fa17f5d838fc12080a25'],
 'outCitations': ['1366de5bb112746a555e9c0cd00de3ad8628aea8',
  '8ea4aac5f2995bbf8480018ecafa0b317d70177c',
  'b34f3795e3ef3167706e923024c40584d2aba0d9',
  'e225684e3172b2cbb41ed1b9d0791232043bf6eb',
  'f19284f6ab802c8a1fcde076fcb3fba195a71723',
  '168bb1336676ff7658425a8609b8ac5b8b10aaa2',
  'c3108c14216c0c80609743a20f5cea244f62620a',
  '4d376d6978dad0374edfa6709c9556b42d3594d3',
  '3c1e44605135ff2656cd122c8e158dfc7313cef8',
  '1cba60bd2d8bec7b0a02bbd6ddcbc2d59aa69da2',
  '6bc81f58f93b76b62042f510fb9f39ea5480802c',
  '2f754f83d9c92322924c4180dd549890d7e50352',
  '5fa91b23cef2dcc07cebe34a8b6e89e44a7e3e50',
  '0f84a81f431b18a78bd97f59ed4b9d8eda390970',
  'a6cb366736791bcccc5c8639de5a8f9636bf87e8',
  '29e03ddae4edf90d49defe4143151411e7822cd6',
  'fdd3d4c4c9c8a3866f860ff7c23393de39d098a1',
  '317aee7fc081f2b137a85c4f20129007fd8e717e',
  '5f8bc6319446f38d62fb154505312115e7f59baf',
  '6364fdaa0a0eccd823a779fcdd489173f938e91a',
  '3c09bb369efdcae72f4d2aee8c25efb473864162',
  'a5d36aa713a02851276aab629f2c4797aa0749a3',
  '876c4856b41efc07581ddc0a9c6c63d1598e43f2',
  '48378c5f5a472def51401bd14e2ee131141bc064',
  '44c481f3f17a8b1e9d1965ad37b1f88eca252779',
  'c02a962e75d570dde44ba3a2f0ea7616e8d1a78f',
  '49dbeef8655afd3a7abb0fcfb7a2e27da8c09d2d',
  'eafb457dde5c76a789071067a76864c283586e51',
  '5e83ab70d0cbc003471e87ec306d27d9c80ecb16',
  '5562a56da3a96dae82add7de705e2bd841eb00fc',
  '49b178278a6ade55938ce2d5398cd7a1f9b3e922',
  'fcf43325529c8b1cc26aeb52fd5d7e532abb0a40',
  '09193e19b59fc8f05bee9d6efbfb1607ca5b6501',
  '62766c143e0ca5974a3ddf6fdfa6472862af8970',
  '10c0f65c03f9d7fa120205d4da97977df988d0c8',
  '58ad20f90181c920b60d06fe80519f4551ad2030',
  '15227b55dd33c6a3285af680fb21194c039df3e8',
  '3dc1e5fbf7842c214554aac02343cfd1b44ea435',
  '23045299013e8738bc8eff73827ef8de256aef66',
  '34f25a8704614163c4095b3ee2fc969b60de4698',
  '5e1d7732f7e3fa229feb7248d0dc763b8abac0a5',
  '42cac845ecb43000186f1d8ad4aacc2e2d0ce3cc',
  'f51ac2bdcc2addfd2fe540ac483ae4d54016400e',
  '77d1b1faf44e7f0ce7b36fcc9e0c2d3296178dd9',
  '13db654a257b1f094a6eb07d1f6378ae34f54163',
  'e794ba3dca412f432a8e99f9a84b1f6514b42829',
  'd9dab7574d56ae81efe6c90c213c6509b36cf950',
  'abd1c342495432171beb7ca8fd9551ef13cbd0ff',
  '09ef0911c4579fe14ae1c1b7ba8156cc43bd440e',
  'a8e8f3c8d4418c8d62e306538c9c1292635e9d27',
  '4fb85c05adbaf101a781c0ccc78017be41d47d17',
  '6a22e8ddce1eee044f990e1ac3b7ec0fb77cca0e',
  '8f35c1b4c4cc06176350d827a555e06cc86d3f67',
  'b0d64e7135d043185aee8b89a89372fb79deed94',
  'f7237cc2f7fdcf462e8b0b4c2c6e645facdb0e0b',
  '9f265054bed707c49a11214126f975cc565d3279',
  '5c8d575d43f24c86698181e42e62065a37f021c4',
  'a4c0881ea9e08b674bdc81286c07d3fc8dc0cc78',
  'a53a49dde4bc6a7ee8b9cd4b8f60ac724a194fc0'],
 'fieldsOfStudy': ['Computer Science', 'Medicine'],
 'year': 2019,
 'venue': 'BMC biomedical engineering',
 'journalName': 'BMC Biomedical Engineering',
 'journalVolume': '1',
 'journalPages': '',
 'sources': ['Medline'],
 'doi': '10.1186/s42490-019-0003-2',
 'doiUrl': 'https://doi.org/10.1186/s42490-019-0003-2',
 'pmid': '32903308',
 'magId': '2943196448'}
lvwerra commented 2 years ago

self-assign

lvwerra commented 2 years ago

Two questions about integrating this for language modeling:

  1. Should title and abstract be concatenated?
  2. The description on the hub and here states that the dataset also includes full text, however, I can not see where they would be. Do you know?

cc @albertvillanova

lvwerra commented 2 years ago

Also note that note that the dataset contains other languages as well. Looking at a few examples in the dataset viewer on the hub one can see Chinese and Japanese examples as well.

lvwerra commented 2 years ago

Done: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_s2orc

Note: For now this only contains the titles and abstract separated by a newline. Documents without abstract are filtered (it seems this also helps a bit with the language contamination as samples in other languages often only have a title). cc @albertvillanova @yjernite

Sample:


{'text': "Clinical or Industrial Pharmacy? Case Studies of Hospital Pharmacy Automation in Canada and France...",
 'meta': "{'id': '062e9c7579adc73129e1198671d05905f07d4ab5'}"}
albertvillanova commented 2 years ago

Thanks @lvwerra.

Should I keep this issue open until we know if we will also have the full text?