dionis / SpanishMedicaLLM

An Open Source Medical Context Large Language Model (LLM) for Q&A and Prompt in Spanish Using Fine-Tuning Techniques with QLora and Epfl with Low Compute Resources. Inspired on Meditron as a suite of open-source medical Large Language Models (LLMs).
https://huggingface.co/epfl-llm
Apache License 2.0
0 stars 0 forks source link

Identify which Spanish language corpora can be used for training an LLM in Spanish for the medical context. #15

Open dionis opened 6 months ago

dionis commented 6 months ago

Taking as a guide the resources used in the training and autotunning of the proposed model in article Meditron-70b: Scaling medical pretraining for large language models (Meditron), check if there are other free resources in Universities or Research Institutions in:

Resources to consider

https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es?tab=readme-ov-file

Resource build by another team member (discord user @danielbrdz)

Spanish Biomedical Crawled Corpus (https://zenodo.org/records/5513237#.Yp7lU_exWV4)

Medical Lexicon for Spanish (MedLexSp) [DATASET] (Paper)

A Survey of Spanish Clinical Language Models (Paper)

Spanish Pre-Trained Language Models for HealthCare Industry (paper).

Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text (Paper)

Pretrained biomedical language models for clinical NLP in Spanish (Paper)

Pre-trained language models in Spanish for health insurance coverage (Paper)

Note: Analyze and test if you have an API for resource download, it is necessary to perform scrapy.

Expected Result:

IMPORTANT

From conference in youtube video NLP clínico en español con Jocelyn Dunstan | Hackathon Somos NLP 2023 and reference from PhD Jocelyn Dunstan the paper Pre-trained Biomedical Language Models for Clinical NLP in Spanish reference the downloaded corpus MeSpEn_Parallel-Corpora and the paper The E3C Project: European Clinical Case Corpus reference the downloaded corpus European Clinical Case Corpus and in the paper A transcription and information extraction system to facilitate EHR documentation in Spanish the downloade corpus Files-digital-scribe and the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition)

And check too the corpus Biomedical Spanish CBOW Word Embeddings in Floret find out in reference https://github.com/PlanTL-GOB-ES/lm-spanish?tab=readme-ov-file.