Open albertvillanova opened 2 years ago
cc @kyleclo
Exists in the hub. However based on the data card, the current version is likely v20190928 rather than the latest one. This means the dataset is missing the updates detailed here.
Thanks @j-chim.
Link: https://huggingface.co/datasets/s2orc
The version of the canonical dataset has date 2020-12-01:
_ROOT_URL = "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2020-12-01/"
Sample:
{'id': '984774e366d3d4fcf0fce659f697478dccd4f93c',
'title': 'Fully convolutional architecture vs sliding-window CNN for corneal endothelium cell segmentation',
'paperAbstract': 'BackgroundCorneal endothelium (CE) images provide valuable clinical information regarding the health state of the cornea. Computation of the clinical morphometric parameters requires the segmentation of endothelial cell images. Current techniques to image the endothelium in vivo deliver low quality images, which makes automatic segmentation a complicated task. Here, we present two convolutional neural networks (CNN) to segment CE images: a global fully convolutional approach based on U-net, and a local sliding-window network (SW-net). We propose to use probabilistic labels instead of binary, we evaluate a preprocessing method to enhance the contrast of images, and we introduce a postprocessing method based on Fourier analysis and watershed to convert the CNN output images into the final cell segmentation. Both methods are applied to 50 images acquired with an SP-1P Topcon specular microscope. Estimates are compared against a manual delineation made by a trained observer.ResultsU-net (AUC=0.9938) yields slightly sharper, clearer images than SW-net (AUC=0.9921). After postprocessing, U-net obtains a DICE=0.981 and a MHD=0.22 (modified Hausdorff distance), whereas SW-net yields a DICE=0.978 and a MHD=0.30. U-net generates a wrong cell segmentation in only 0.48% of the cells, versus 0.92% for the SW-net. U-net achieves statistically significant better precision and accuracy than both, Topcon and SW-net, for the estimates of three clinical parameters: cell density (ECD), polymegethism (CV), and pleomorphism (HEX). The mean relative error in U-net for the parameters is 0.4% in ECD, 2.8% in CV, and 1.3% in HEX. The computation time to segment an image and estimate the parameters is barely a few seconds.ConclusionsBoth methods presented here provide a statistically significant improvement over the state of the art. U-net has reached the smallest error rate. We suggest a segmentation refinement based on our previous work to further improve the performance.',
'entities': [],
's2Url': 'https://semanticscholar.org/paper/984774e366d3d4fcf0fce659f697478dccd4f93c',
'pdfUrls': ['https://bmcbiomedeng.biomedcentral.com/track/pdf/10.1186/s42490-019-0003-2'],
's2PdfUrl': '',
'authors': [{'name': 'Juan P. Vigueras-Guillén', 'ids': ['1404113169']},
{'name': 'Busra Sari', 'ids': ['113782037']},
{'name': 'Stanley F. Goes', 'ids': ['1419382500']},
{'name': 'Hans G. Lemij', 'ids': ['82955974']},
{'name': 'Jeroen van Rooij', 'ids': ['143638168']},
{'name': 'Koenraad A. Vermeer', 'ids': ['1693033']},
{'name': 'Lucas J. van Vliet', 'ids': ['134339884']}],
'inCitations': ['c62f7bfce0aee11940925f4619d9f9be60c7b5ba',
'0d1b8f40bfa29acd1ef6ced824bc89f74426422c',
'8d4e0d120887cba389ad8897b660e8c663de6bb6',
'ec3aad2e8c5ab6bdd3802cb7d147684541dc4631',
'a38b350c48aa54b4905ce09877231949eb9ef96d',
'61aa49a2cf81075bf7e2d081aa897a78f91b35c1',
'178502d887d652758d33e453a85c85db3114cc1f',
'e37385e5196dc2a884bea325bd939006f6eaf8ff',
'f513c0e665f02c23b7cc04492ce161865c84785a',
'bb8c06e6d41175417a951aaa9c59691b1a74cf59',
'e999babf70ceb7382233fa17f5d838fc12080a25'],
'outCitations': ['1366de5bb112746a555e9c0cd00de3ad8628aea8',
'8ea4aac5f2995bbf8480018ecafa0b317d70177c',
'b34f3795e3ef3167706e923024c40584d2aba0d9',
'e225684e3172b2cbb41ed1b9d0791232043bf6eb',
'f19284f6ab802c8a1fcde076fcb3fba195a71723',
'168bb1336676ff7658425a8609b8ac5b8b10aaa2',
'c3108c14216c0c80609743a20f5cea244f62620a',
'4d376d6978dad0374edfa6709c9556b42d3594d3',
'3c1e44605135ff2656cd122c8e158dfc7313cef8',
'1cba60bd2d8bec7b0a02bbd6ddcbc2d59aa69da2',
'6bc81f58f93b76b62042f510fb9f39ea5480802c',
'2f754f83d9c92322924c4180dd549890d7e50352',
'5fa91b23cef2dcc07cebe34a8b6e89e44a7e3e50',
'0f84a81f431b18a78bd97f59ed4b9d8eda390970',
'a6cb366736791bcccc5c8639de5a8f9636bf87e8',
'29e03ddae4edf90d49defe4143151411e7822cd6',
'fdd3d4c4c9c8a3866f860ff7c23393de39d098a1',
'317aee7fc081f2b137a85c4f20129007fd8e717e',
'5f8bc6319446f38d62fb154505312115e7f59baf',
'6364fdaa0a0eccd823a779fcdd489173f938e91a',
'3c09bb369efdcae72f4d2aee8c25efb473864162',
'a5d36aa713a02851276aab629f2c4797aa0749a3',
'876c4856b41efc07581ddc0a9c6c63d1598e43f2',
'48378c5f5a472def51401bd14e2ee131141bc064',
'44c481f3f17a8b1e9d1965ad37b1f88eca252779',
'c02a962e75d570dde44ba3a2f0ea7616e8d1a78f',
'49dbeef8655afd3a7abb0fcfb7a2e27da8c09d2d',
'eafb457dde5c76a789071067a76864c283586e51',
'5e83ab70d0cbc003471e87ec306d27d9c80ecb16',
'5562a56da3a96dae82add7de705e2bd841eb00fc',
'49b178278a6ade55938ce2d5398cd7a1f9b3e922',
'fcf43325529c8b1cc26aeb52fd5d7e532abb0a40',
'09193e19b59fc8f05bee9d6efbfb1607ca5b6501',
'62766c143e0ca5974a3ddf6fdfa6472862af8970',
'10c0f65c03f9d7fa120205d4da97977df988d0c8',
'58ad20f90181c920b60d06fe80519f4551ad2030',
'15227b55dd33c6a3285af680fb21194c039df3e8',
'3dc1e5fbf7842c214554aac02343cfd1b44ea435',
'23045299013e8738bc8eff73827ef8de256aef66',
'34f25a8704614163c4095b3ee2fc969b60de4698',
'5e1d7732f7e3fa229feb7248d0dc763b8abac0a5',
'42cac845ecb43000186f1d8ad4aacc2e2d0ce3cc',
'f51ac2bdcc2addfd2fe540ac483ae4d54016400e',
'77d1b1faf44e7f0ce7b36fcc9e0c2d3296178dd9',
'13db654a257b1f094a6eb07d1f6378ae34f54163',
'e794ba3dca412f432a8e99f9a84b1f6514b42829',
'd9dab7574d56ae81efe6c90c213c6509b36cf950',
'abd1c342495432171beb7ca8fd9551ef13cbd0ff',
'09ef0911c4579fe14ae1c1b7ba8156cc43bd440e',
'a8e8f3c8d4418c8d62e306538c9c1292635e9d27',
'4fb85c05adbaf101a781c0ccc78017be41d47d17',
'6a22e8ddce1eee044f990e1ac3b7ec0fb77cca0e',
'8f35c1b4c4cc06176350d827a555e06cc86d3f67',
'b0d64e7135d043185aee8b89a89372fb79deed94',
'f7237cc2f7fdcf462e8b0b4c2c6e645facdb0e0b',
'9f265054bed707c49a11214126f975cc565d3279',
'5c8d575d43f24c86698181e42e62065a37f021c4',
'a4c0881ea9e08b674bdc81286c07d3fc8dc0cc78',
'a53a49dde4bc6a7ee8b9cd4b8f60ac724a194fc0'],
'fieldsOfStudy': ['Computer Science', 'Medicine'],
'year': 2019,
'venue': 'BMC biomedical engineering',
'journalName': 'BMC Biomedical Engineering',
'journalVolume': '1',
'journalPages': '',
'sources': ['Medline'],
'doi': '10.1186/s42490-019-0003-2',
'doiUrl': 'https://doi.org/10.1186/s42490-019-0003-2',
'pmid': '32903308',
'magId': '2943196448'}
Two questions about integrating this for language modeling:
cc @albertvillanova
Also note that note that the dataset contains other languages as well. Looking at a few examples in the dataset viewer on the hub one can see Chinese and Japanese examples as well.
Done: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_s2orc
Note: For now this only contains the titles and abstract separated by a newline. Documents without abstract are filtered (it seems this also helps a bit with the language contamination as samples in other languages often only have a title). cc @albertvillanova @yjernite
Sample:
{'text': "Clinical or Industrial Pharmacy? Case Studies of Hospital Pharmacy Automation in Canada and France...",
'meta': "{'id': '062e9c7579adc73129e1198671d05905f07d4ab5'}"}
Thanks @lvwerra.
Should I keep this issue open until we know if we will also have the full text?