LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.93k stars 3.22k forks source link

Radiology Q-A dataset #1044

Open Luab opened 1 year ago

Luab commented 1 year ago

Take the pages from radiopedia which contains about 16k articles under CC licence. Could be used in simple Q-A setting where question is rephrased name of section and answer is the paragraph from that section. For example: Cardiomegaly (https://radiopaedia.org/articles/cardiomegaly) Q: What are the Radiographic features of Cardiomegaly? A: In most cases, merely 'eye-balling' a chest x-ray will be sufficient in detecting cardiomegaly (as the heart is either clearly normal in size or clearly abnormally enlarged). In equivocal cases, the cardiothoracic ratio (CTR) can be easily calculated on a PA chest x-ray. The CTR measures the width of the cardiac silhouette and the thoracic cavity; a ratio greater than 0.5 is an abnormal finding.

huu4ontocord commented 1 year ago

thank you!

hashdaddyd commented 1 year ago

If you need the articles I’ve already scraped and cleaned them up for training a BERT model, happy to share.

sandorkonya commented 1 year ago

@hashdaddyd with the image captions or without?

hashdaddyd commented 1 year ago

@sandorkonya no images just text, I've been working on pre-processing to clean up the articles (poor structure, lots of missing punctuation) and augmented with synonymy substitution giving us ~45k articles in total. We've also been categorizing/labelling entities (we're a couple of radiologists) to build an ontology and KG from Radiopaedia and early results show training on Radiopaedia articles are better than imaging reports (manuscript under review at the moment). It's a good corpus for pretraining.

With respects to cases/images, the Radiopaedia license expressly prohibits use of their images for training or scraping (more vague re articles) and my reading of their license is copyright is held by the original case submitter, unlike text which is owned by Radiopaedia and released under a CC-by-nc-sa license. IANAL so if someone is more aware of the legal nuances here I can share image captions and scraped images from commoncrawl/webarchives (they don't seem to be blocking crawlers in robots.txt), however we did not use this in our paper given the legal uncertainty.

ptschandl commented 1 year ago

+ data biomed label?

Luab commented 1 year ago

@hashdaddyd wow, that would be great. If you could also provide source code of the scrapper that would be absolutely perfect! Thanks!

hashdaddyd commented 1 year ago

No problem, we were already planning on sharing our code and the various datasets after peer review.

I’ve been waiting to clean up a couple of things before making the repo public and am in crunch mode for my upcoming board exam, but I can get to this in the next few weeks. If you need this sooner I can push the article dataset to huggingface in the meantime.

Luab commented 1 year ago

I think there is no rush. Please share the link to your repo whenever you are ready.

ilisparrow commented 1 year ago

Hello, Thank you for your work. Do you plan on sharing the methodology you used to clean that dataset ?

hashdaddyd commented 1 year ago

@ilisparrow yup, it’s described in the paper (currently under review) and will be in the repo as well.

huu4ontocord commented 1 year ago

Hi @hashdaddyd and @Luab, checking on status of this dataset.

hashdaddyd commented 1 year ago

Apologies I've been extremely busy lately, my schedule will free up at the end of March. I'll post the source for parsing and augmentation as soon as I can, there are just a few broken functions I've worked around in a notebook but I have to fix in the library before making the repo public.

In the interim I've posted the base cleaned dataset with synonyms, links and article descriptions: https://huggingface.co/datasets/hashdujaili/radiopaedia

Given the request for headings to generate Q-A pairs I've bracketed the relevant sections with in the cleaned text.

If there's missing fields needed in the immediate future let me know and I'll try and fit it in.

ChantalMP commented 1 year ago

Sounds really interesting! What's the current status of this? :)

huu4ontocord commented 1 year ago

@hashdaddyd and @ChantalMP - do you have recommendations for more data similar to above?

hashdaddyd commented 1 year ago

Dataset is on huggingface. Code updated but still in review process.

In typical fashion reviewer 2 was unhappy and has some revisions. More specifically, the main issue is that the article categories affixed by Radiopaedia editors (e.g. disease, imaging test, imaging finding, etc.) are unreliable either because missing or incorrectly applied so we don’t have an accurate breakdown on what the different types of articles are to use the links (I.e. an imaging sign mentioned in an article is a sign of the disease, as we do not necessarily know which articles are truly diseases).

To satisfy this, a colleague and I (both radiologists) have separately manually categorized all 15,000+ articles. We’ll be meeting this week to reconcile discrepancies and I will subsequently add the consensus labels to the dataset.

I think with this it should result in a relatively useful KG (at least for disease, differential diagnosis, imaging signs, imaging test) and generate training prompts.

Open to thoughts if anyone has anything different to suggest.

hashdaddyd commented 1 year ago

@hashdaddyd and @ChantalMP - do you have recommendations for more data similar to above?

Pathologyoutlines is also CC (or fair use) I believe and has useful articles.

StatPearls is another great resource that can be framed as prompts and is definitely CC.

ChantalMP commented 1 year ago

@hashdaddyd I cannot find your radiopaedia dataset on huggingface anymore. Is there any way to make it available again? :)