An Open Source Medical Context Large Language Model (LLM) for Q&A and Prompt in Spanish Using Fine-Tuning Techniques with QLora and Epfl with Low Compute Resources. Inspired on Meditron as a suite of open-source medical Large Language Models (LLMs).
Following the guide the corpus construction proposed in article Meditron-70b: Scaling medical pretraining for large language models and from various sources of the Spanish language, build a corpus for training the model.
Medical Lexicon for Spanish (MedLexSp) [DATASET] (Paper)
A Survey of Spanish Clinical Language Models (Paper)
Spanish Pre-Trained Language Models for HealthCare Industry (paper).
Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text (Paper)
Pretrained biomedical language models for clinical NLP in Spanish (Paper)
Pre-trained language models in Spanish for health insurance coverage (Paper)
and another approaches
Note: Document all source and identify always if is possible reproduce or use in free format.
Expected results:
An amount of resources comparable to Meditro. Show comparative table of the resources and sources obtained.
A corpus that can be used as input for pre-training or self-tuning of an LLM (same as the one proposed for the Meditron model)
A document that establishes the sources, the decisions for the selection and the characteristics of each of the sources used for the construction of the corpus.
Note: See how it is done in the article "Meditron-70b: Scaling medical pretraining for large language models", annexes and the source code that proposes the presented model.
Following the guide the corpus construction proposed in article Meditron-70b: Scaling medical pretraining for large language models and from various sources of the Spanish language, build a corpus for training the model.
Resources to consider
https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es?tab=readme-ov-file
Resource build by another team member (discord user @danielbrdz)
Spanish Biomedical Crawled Corpus (https://zenodo.org/records/5513237#.Yp7lU_exWV4)
Medical Lexicon for Spanish (MedLexSp) [DATASET] (Paper)
A Survey of Spanish Clinical Language Models (Paper)
Spanish Pre-Trained Language Models for HealthCare Industry (paper).
Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text (Paper)
Pretrained biomedical language models for clinical NLP in Spanish (Paper)
Pre-trained language models in Spanish for health insurance coverage (Paper)
and another approaches
Note: Document all source and identify always if is possible reproduce or use in free format.
Expected results:
An amount of resources comparable to Meditro. Show comparative table of the resources and sources obtained.
A corpus that can be used as input for pre-training or self-tuning of an LLM (same as the one proposed for the Meditron model)
A document that establishes the sources, the decisions for the selection and the characteristics of each of the sources used for the construction of the corpus.
Note: See how it is done in the article "Meditron-70b: Scaling medical pretraining for large language models", annexes and the source code that proposes the presented model.
IMPORTANT
From conference in youtube video NLP clínico en español con Jocelyn Dunstan | Hackathon Somos NLP 2023 and reference from PhD Jocelyn Dunstan the paper Pre-trained Biomedical Language Models for Clinical NLP in Spanish reference the downloaded corpus MeSpEn_Parallel-Corpora and the paper The E3C Project: European Clinical Case Corpus reference the downloaded corpus European Clinical Case Corpus and in the paper A transcription and information extraction system to facilitate EHR documentation in Spanish the downloade corpus Files-digital-scribe and the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition)
And check too the corpus Biomedical Spanish CBOW Word Embeddings in Floret find out in reference https://github.com/PlanTL-GOB-ES/lm-spanish?tab=readme-ov-file.
Use too the hugginface datasets
-LenguajeNaturalAI/casos_clinicos_tratamiento
-LenguajeNaturalAI/casos_clinicos_tratamiento
From Hemos colaborado con profesionales de diferentes sectores para desarrollar cuatro corpus de dominio específicos para evaluar LLMs en español