An Open Source Medical Context Large Language Model (LLM) for Q&A and Prompt in Spanish Using Fine-Tuning Techniques with QLora and Epfl with Low Compute Resources. Inspired on Meditron as a suite of open-source medical Large Language Models (LLMs).
Taking as a guide the resources used in the training and autotunning of the proposed model in article Meditron-70b: Scaling medical pretraining for large language models (Meditron), check if there are other free resources in Universities or Research Institutions in:
https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es?tab=readme-ov-file
Resource build by another team member (discord user @danielbrdz)
Spanish Biomedical Crawled Corpus (https://zenodo.org/records/5513237#.Yp7lU_exWV4)
Medical Lexicon for Spanish (MedLexSp) [DATASET] (Paper)
A Survey of Spanish Clinical Language Models (Paper)
Spanish Pre-Trained Language Models for HealthCare Industry (paper).
Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text (Paper)
Pretrained biomedical language models for clinical NLP in Spanish (Paper)
Pre-trained language models in Spanish for health insurance coverage (Paper)
Note:
Analyze and test if you have an API for resource download, it is necessary to perform scrapy.
Expected Result:
List of resources that can be used to enrich the training of models in Spanish.
A readme in markdown style or another type of document that allows obtaining the details of the investigation process for subsequent decision making.
Taking as a guide the resources used in the training and autotunning of the proposed model in article Meditron-70b: Scaling medical pretraining for large language models (Meditron), check if there are other free resources in Universities or Research Institutions in:
Resources to consider
Note: Analyze and test if you have an API for resource download, it is necessary to perform scrapy.
Expected Result:
List of resources that can be used to enrich the training of models in Spanish.
A readme in markdown style or another type of document that allows obtaining the details of the investigation process for subsequent decision making.
IMPORTANT
From conference in youtube video NLP clínico en español con Jocelyn Dunstan | Hackathon Somos NLP 2023 and reference from PhD Jocelyn Dunstan the paper Pre-trained Biomedical Language Models for Clinical NLP in Spanish reference the downloaded corpus MeSpEn_Parallel-Corpora and the paper The E3C Project: European Clinical Case Corpus reference the downloaded corpus European Clinical Case Corpus and in the paper A transcription and information extraction system to facilitate EHR documentation in Spanish the downloade corpus Files-digital-scribe and the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition)
And check too the corpus Biomedical Spanish CBOW Word Embeddings in Floret find out in reference https://github.com/PlanTL-GOB-ES/lm-spanish?tab=readme-ov-file.