Identify which Spanish language corpora can be used for training an LLM in Spanish for the medical context.

Taking as a guide the resources used in the training and autotunning of the proposed model in article Meditron-70b: Scaling medical pretraining for large language models (Meditron), check if there are other free resources in Universities or Research Institutions in:

Mexico
Chili
Uruguay
Argentina
Cuba (https://www.sld.cu/)

Resources to consider

https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es?tab=readme-ov-file

Resource build by another team member (discord user @danielbrdz)

Spanish Biomedical Crawled Corpus (https://zenodo.org/records/5513237#.Yp7lU_exWV4)

Medical Lexicon for Spanish (MedLexSp) [DATASET] (Paper)

A Survey of Spanish Clinical Language Models (Paper)

Spanish Pre-Trained Language Models for HealthCare Industry (paper).

Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text (Paper)

Pretrained biomedical language models for clinical NLP in Spanish (Paper)

Pre-trained language models in Spanish for health insurance coverage (Paper)

Note: Analyze and test if you have an API for resource download, it is necessary to perform scrapy.

Expected Result:

List of resources that can be used to enrich the training of models in Spanish.
A readme in markdown style or another type of document that allows obtaining the details of the investigation process for subsequent decision making.

IMPORTANT

From conference in youtube video NLP clínico en español con Jocelyn Dunstan | Hackathon Somos NLP 2023 and reference from PhD Jocelyn Dunstan the paper Pre-trained Biomedical Language Models for Clinical NLP in Spanish reference the downloaded corpus MeSpEn_Parallel-Corpora and the paper The E3C Project: European Clinical Case Corpus reference the downloaded corpus European Clinical Case Corpus and in the paper A transcription and information extraction system to facilitate EHR documentation in Spanish the downloade corpus Files-digital-scribe and the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition)

And check too the corpus Biomedical Spanish CBOW Word Embeddings in Floret find out in reference https://github.com/PlanTL-GOB-ES/lm-spanish?tab=readme-ov-file.

dionis / SpanishMedicaLLM

Identify which Spanish language corpora can be used for training an LLM in Spanish for the medical context. #15