Open Luab opened 1 year ago
thank you!
If you need the articles I’ve already scraped and cleaned them up for training a BERT model, happy to share.
@hashdaddyd with the image captions or without?
@sandorkonya no images just text, I've been working on pre-processing to clean up the articles (poor structure, lots of missing punctuation) and augmented with synonymy substitution giving us ~45k articles in total. We've also been categorizing/labelling entities (we're a couple of radiologists) to build an ontology and KG from Radiopaedia and early results show training on Radiopaedia articles are better than imaging reports (manuscript under review at the moment). It's a good corpus for pretraining.
With respects to cases/images, the Radiopaedia license expressly prohibits use of their images for training or scraping (more vague re articles) and my reading of their license is copyright is held by the original case submitter, unlike text which is owned by Radiopaedia and released under a CC-by-nc-sa license. IANAL so if someone is more aware of the legal nuances here I can share image captions and scraped images from commoncrawl/webarchives (they don't seem to be blocking crawlers in robots.txt), however we did not use this in our paper given the legal uncertainty.
+ data biomed label?
@hashdaddyd wow, that would be great. If you could also provide source code of the scrapper that would be absolutely perfect! Thanks!
No problem, we were already planning on sharing our code and the various datasets after peer review.
I’ve been waiting to clean up a couple of things before making the repo public and am in crunch mode for my upcoming board exam, but I can get to this in the next few weeks. If you need this sooner I can push the article dataset to huggingface in the meantime.
I think there is no rush. Please share the link to your repo whenever you are ready.
Hello, Thank you for your work. Do you plan on sharing the methodology you used to clean that dataset ?
@ilisparrow yup, it’s described in the paper (currently under review) and will be in the repo as well.
Hi @hashdaddyd and @Luab, checking on status of this dataset.
Apologies I've been extremely busy lately, my schedule will free up at the end of March. I'll post the source for parsing and augmentation as soon as I can, there are just a few broken functions I've worked around in a notebook but I have to fix in the library before making the repo public.
In the interim I've posted the base cleaned dataset with synonyms, links and article descriptions: https://huggingface.co/datasets/hashdujaili/radiopaedia
Given the request for headings to generate Q-A pairs I've bracketed the relevant sections with
If there's missing fields needed in the immediate future let me know and I'll try and fit it in.
Sounds really interesting! What's the current status of this? :)
@hashdaddyd and @ChantalMP - do you have recommendations for more data similar to above?
Dataset is on huggingface. Code updated but still in review process.
In typical fashion reviewer 2 was unhappy and has some revisions. More specifically, the main issue is that the article categories affixed by Radiopaedia editors (e.g. disease, imaging test, imaging finding, etc.) are unreliable either because missing or incorrectly applied so we don’t have an accurate breakdown on what the different types of articles are to use the links (I.e. an imaging sign mentioned in an article is a sign of the disease, as we do not necessarily know which articles are truly diseases).
To satisfy this, a colleague and I (both radiologists) have separately manually categorized all 15,000+ articles. We’ll be meeting this week to reconcile discrepancies and I will subsequently add the consensus labels to the dataset.
I think with this it should result in a relatively useful KG (at least for disease, differential diagnosis, imaging signs, imaging test) and generate training prompts.
Open to thoughts if anyone has anything different to suggest.
@hashdaddyd and @ChantalMP - do you have recommendations for more data similar to above?
Pathologyoutlines is also CC (or fair use) I believe and has useful articles.
StatPearls is another great resource that can be framed as prompts and is definitely CC.
@hashdaddyd I cannot find your radiopaedia dataset on huggingface anymore. Is there any way to make it available again? :)
Hello @hashdaddyd, I'm currently working on a project and I would like to work with a radiopaedia dataset, I would like to know if there was a way to access your dataset today? Thank you!
Take the pages from radiopedia which contains about 16k articles under CC licence. Could be used in simple Q-A setting where question is rephrased name of section and answer is the paragraph from that section. For example: Cardiomegaly (https://radiopaedia.org/articles/cardiomegaly) Q: What are the Radiographic features of Cardiomegaly? A: In most cases, merely 'eye-balling' a chest x-ray will be sufficient in detecting cardiomegaly (as the heart is either clearly normal in size or clearly abnormally enlarged). In equivocal cases, the cardiothoracic ratio (CTR) can be easily calculated on a PA chest x-ray. The CTR measures the width of the cardiac silhouette and the thoracic cavity; a ratio greater than 0.5 is an abnormal finding.