dionis / SpanishMedicaLLM

An Open Source Medical Context Large Language Model (LLM) for Q&A and Prompt in Spanish Using Fine-Tuning Techniques with QLora and Epfl with Low Compute Resources. Inspired on Meditron as a suite of open-source medical Large Language Models (LLMs).
https://huggingface.co/epfl-llm
Apache License 2.0
0 stars 0 forks source link

Build a corpus of medical documents in Spanish for the training of an LLM #6

Open dionis opened 6 months ago

dionis commented 6 months ago

Following the guide the corpus construction proposed in article Meditron-70b: Scaling medical pretraining for large language models and from various sources of the Spanish language, build a corpus for training the model.

Resources to consider

and another approaches

Note: Document all source and identify always if is possible reproduce or use in free format.

Expected results:

  1. An amount of resources comparable to Meditro. Show comparative table of the resources and sources obtained.

  2. A corpus that can be used as input for pre-training or self-tuning of an LLM (same as the one proposed for the Meditron model)

  3. A document that establishes the sources, the decisions for the selection and the characteristics of each of the sources used for the construction of the corpus.

Note: See how it is done in the article "Meditron-70b: Scaling medical pretraining for large language models", annexes and the source code that proposes the presented model.

IMPORTANT

From conference in youtube video NLP clínico en español con Jocelyn Dunstan | Hackathon Somos NLP 2023 and reference from PhD Jocelyn Dunstan the paper Pre-trained Biomedical Language Models for Clinical NLP in Spanish reference the downloaded corpus MeSpEn_Parallel-Corpora and the paper The E3C Project: European Clinical Case Corpus reference the downloaded corpus European Clinical Case Corpus and in the paper A transcription and information extraction system to facilitate EHR documentation in Spanish the downloade corpus Files-digital-scribe and the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition)

And check too the corpus Biomedical Spanish CBOW Word Embeddings in Floret find out in reference https://github.com/PlanTL-GOB-ES/lm-spanish?tab=readme-ov-file.

Use too the hugginface datasets

-LenguajeNaturalAI/casos_clinicos_tratamiento

-LenguajeNaturalAI/casos_clinicos_tratamiento

From Hemos colaborado con profesionales de diferentes sectores para desarrollar cuatro corpus de dominio específicos para evaluar LLMs en español