Create a first version of corpus of medical texts in Spanish for the creation of an LLM

Taking as reference the corpora used for the construction of Meditron, create a corpus with the same characteristics for training an LLM model.

Sources to consult:

Expected results:

A medical corpus in Spanish that can be used as input for self-tuning or training of an LLM model.

A document that establishes the sources, the decisions for the selection and the characteristics of each of the sources used for the construction of the corpus.

Note: See how it is done in the article "Meditron-70b: Scaling medical pretraining for large language models", annexes and the source code that proposes the presented model.

dionis / SpanishMedicaLLM

Create a first version of corpus of medical texts in Spanish for the creation of an LLM #17