Build a corpus of medical documents in Spanish for the training of an LLM

Following the guide the corpus construction proposed in article Meditron-70b: Scaling medical pretraining for large language models and from various sources of the Spanish language, build a corpus for training the model.

Resources to consider

https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es?tab=readme-ov-file
Resource build by another team member (discord user @danielbrdz)
Spanish Biomedical Crawled Corpus (https://zenodo.org/records/5513237#.Yp7lU_exWV4)
Medical Lexicon for Spanish (MedLexSp) [DATASET] (Paper)
A Survey of Spanish Clinical Language Models (Paper)
Spanish Pre-Trained Language Models for HealthCare Industry (paper).
Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text (Paper)
Pretrained biomedical language models for clinical NLP in Spanish (Paper)
Pre-trained language models in Spanish for health insurance coverage (Paper)

and another approaches

Note: Document all source and identify always if is possible reproduce or use in free format.

Expected results:

An amount of resources comparable to Meditro. Show comparative table of the resources and sources obtained.
A corpus that can be used as input for pre-training or self-tuning of an LLM (same as the one proposed for the Meditron model)
A document that establishes the sources, the decisions for the selection and the characteristics of each of the sources used for the construction of the corpus.

Note: See how it is done in the article "Meditron-70b: Scaling medical pretraining for large language models", annexes and the source code that proposes the presented model.

IMPORTANT

From conference in youtube video NLP clínico en español con Jocelyn Dunstan | Hackathon Somos NLP 2023 and reference from PhD Jocelyn Dunstan the paper Pre-trained Biomedical Language Models for Clinical NLP in Spanish reference the downloaded corpus MeSpEn_Parallel-Corpora and the paper The E3C Project: European Clinical Case Corpus reference the downloaded corpus European Clinical Case Corpus and in the paper A transcription and information extraction system to facilitate EHR documentation in Spanish the downloade corpus Files-digital-scribe and the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition)

And check too the corpus Biomedical Spanish CBOW Word Embeddings in Floret find out in reference https://github.com/PlanTL-GOB-ES/lm-spanish?tab=readme-ov-file.

Use too the hugginface datasets

-LenguajeNaturalAI/casos_clinicos_tratamiento

From Hemos colaborado con profesionales de diferentes sectores para desarrollar cuatro corpus de dominio específicos para evaluar LLMs en español

dionis / SpanishMedicaLLM

Build a corpus of medical documents in Spanish for the training of an LLM #6